Filesystems and case-insensitivity

By Jake Edge
November 28, 2018

A recurring topic in filesystem-developer circles is on handling case-insensitive file names. Filesystems for other operating systems do so but, by and large, Linux filesystems do not. In the Kernel Summit track of the 2018 Linux Plumbers Conference (LPC), Gabriel Krisman Bertazi described his plans for making Linux filesystems encoding-aware as part of an effort to make ext4, and possibly other filesystems, interoperable with case-insensitivity in Android, Windows, and macOS.

Case-insensitive file names for Linux have been discussed for a long time. The oldest reference he could find was from 2002, but it has come up at several Linux Storage, Filesystem, and Memory-Management Summits (LSFMM), including in 2016 and in Krisman's presentation this year. It has languished so long without a real solution because the problem has many corner cases and it is "tricky to get it right".

An attendee asked about XFS and its handling of case-insensitive file names. Krisman said that when an XFS filesystem is created, it can be configured to handle them. It is ASCII-only, though a proposal from SGI in 2014 would have added full UTF-8 support for XFS and extended the case-handling to Unicode file names.

The traditional Unix approach is that file names are opaque byte sequences that cannot contain "/" characters. He is proposing to add encoding awareness to filesystems, but, he asked, what are the advantages of doing so? For one thing, Windows and macOS have encoding-aware filesystems; it is a feature that Linux lacks. There are "real world use cases" as well: porting from the Windows world, dealing with the case-insensitive tree that Android exposes, and, in general, providing better support for exported filesystems. Android has a user-space hack for case handling, but it is slow and has many race conditions. An encoding-aware filesystem is a better way to expose this functionality to users, he said.

Unicode can represent the "same" string in multiple different ways, via composition for example, but that is confusing. Multiple files with the same-appearing name in a directory, as he showed in his slides [PDF], will be difficult to deal with. That means some kind of normalization will need to be applied. Beyond that, "case" is really only defined in terms of an encoding—it is meaningless for a byte sequence. That is why he implemented encoding awareness before tackling case insensitivity.

The kernel has a Native Language Support (NLS) subsystem but it has multiple limitations. It has trouble dealing with invalid character sequences—in some situations it returns zero, in others something else. It can't deal with multi-byte sequences or code points; for example, to_upper() and to_lower() return a single byte. There is no support for dealing with the evolution of encodings, which is not really a problem for UTF-8 except for unmapped code points—case folding for unmapped points is not stable, he said. In addition, NLS is missing support for normalization and has only partly implemented case folding; the latter is "almost ASCII only".

Start with NLS

So he has been proposing improvements to NLS as part of his encoding and case-insensitive support patch set that has been posted to the ext4 mailing list. It provides a new load_nls_version() function that allows the caller to define the encoding and version that it wants to use. It has a flags argument that allows filesystems to specify the normalization type, case-fold type, and permissiveness mode they want. That version and behavior information would be stored in the superblock of the filesystem.

Krisman's changes would add support for multi-byte characters by adding a new API for comparisons, normalization, and case folding. It will support UTF-8 NFKD normalization that is based on code from the 2014 SGI patch set. It uses a decoding trie and the mechanism is extendable to other normalization types. For example, if support for the Apple filesystem was needed, NFD normalization could be added. The changes he is making are all backward compatible with existing NLS tables and users, Krisman said.

He currently has patches for the kernel, e2fsprogs, and xfstests out for review. This effort is quite different from what he presented at LSFMM back in May.

There was some discussion among attendees about the changes. The original file name will be preserved when it is created, Krisman said, so that makes the filesystem "case preserving" like NTFS. Concern was expressed about containers sharing a filesystem with encoded file names, but having different user-space encodings. That is not a use case that is envisioned, he said; root filesystems will not normally be encoding aware. The most common use cases, Ted Ts'o said, are USB sticks with a FAT filesystem that does case folding or users of other operating systems accessing the filesystem through Samba. A storage appliance will be able to create a case-folding filesystem and Samba can turn off its expensive user-space case-handling solution.

Another use case that Krisman brought up was for SteamOS, which would have a separate partition for game data that would be encoding aware. Ts'o said that there are some inherent assumptions in this work. The primary users will be like the SteamOS or Samba appliance examples and that "all the world is UTF-8". It would be hugely complicated to support different directories with different encodings, he said. He invited those present to point out any problems they see with those assumptions.

James Bottomley asked if the user-space side had been consulted on these choices. He noted that European distributions typically use single-byte encodings and that the Chinese hate UTF-8 because all characters become four bytes in size. Ts'o said that the problem is essentially being handed off to the distributions. POSIX does not have a way for filesystems to communicate the encoding of their file names; if that existed, glibc could handle the differences.

There is no good solution for that problem, Ts'o continued. There will be information in the superblock, which should be exposed via statfs(). That will take some time to happen, so perhaps a sysfs field could be used in the interim.

Krisman said that his implementation tries to make good use of the directory entry (dentry) cache. Equivalent names do not create multiple dentries, there is just one per file. The d_hash() and d_compare() routines needed to be made encoding aware. For now, negative dentries (asserting the absence of a given file name) are not cached; it will require some work to carefully invalidate negative dentries during file creation.

On to case-insensitivity

Supporting case-insensitive file names requires the encoding-awareness changes in order to define what case folding means for a given character. A per-directory inode attribute can be set to turn on case-insensitivity, but that is only allowed on empty directories to avoid name collisions. Case-insensitivity is trivial to implement once the encoding support is available, he said; it is effectively just a special case of encoding.

There are some limitations of the current implementation, starting with the lack of negative dentries in the cache. Directory encryption is not supported since the lookup is based on the hash of the name, but the same hash cannot be generated from two names that normalize to the same name. He proposed storing the file using the hash of the normalization, but was not sure if that would solve the problem.

Another problem area is how to deal with invalid byte sequences. He proposes falling back to the previous behavior, just treating the names as sequences of bytes, when a sequence is invalid for the encoding. There may be some user-space breakage due to normalization or case folding of file names that will need to be handled as well.

The current implementation is for the ext4 filesystem, but the main part is the NLS changes. The ext4-specific changes give other filesystems a roadmap to adding encoding-awareness and case-insensitivity, Krisman said. Ts'o noted that there is no active NLS maintainer currently, so he will take Krisman's changes through the ext4 tree. He will try to test other users of NLS, but explicitly is not volunteering to take on NLS maintenance going forward.

Boaz Harrosh pointed out that Linus Torvalds called negative dentries important for performance reasons. He wondered if there were plans to add them for encoding-aware filesystems. Krisman said that invalidating negative dentries needs careful thought and code but that it should be doable. The path for file renames is particularly tricky. Bottomley asked why negative dentries needed to be handled differently than positive ones. The problem is that many people want case-preserving filesystems, so looking up FOO when foo exists should generate a negative dentry for FOO but that will interfere with case-insensitive lookups for Foo or even foo.

The reaction to this proposal was much more positive than to Krisman's earlier attempt. It would seem that we will soon have the ability to handle case-insensitive ext4 filesystems and the potential is there to add it for others.

[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Vancouver for LPC.]

Index entries for this article
Kernel	Filesystems/Case-independent lookups
Conference	Linux Plumbers Conference/2018

Filesystems and case-insensitivity

Posted Nov 28, 2018 14:07 UTC (Wed) by sorokin (guest, #88478) [Link] (9 responses)

Do anyone have any measurements how much faster/slower case insensitive file systems are?

The problem I see is that filesystems serve two purposes:
1. They are a place where user store his files.
2. They are a place where programs store some internal data. Kind of (key, value) storage with hierarchical key.

For the first usage I see the merit of having case-insensitive filesystem. It depends on personal preference though.

For the second usage case-insensitiveness is a downside. When program lookup some its internal file/resource, case-insensitive comparison is both unnecessary and potentially incorrect. When I scan a directory with readdir/statat I don't want statat to be case-insensitive.

Filesystems and case-insensitivity

Posted Nov 28, 2018 14:28 UTC (Wed) by bandrami (guest, #94229) [Link] (3 responses)

statat(2) doesn't interact with the filename at all, does it? You need to have already opened the file, but it doesn't know or care by what name you opened it, or even if that name still exists.

Filesystems and case-insensitivity

Posted Nov 28, 2018 14:43 UTC (Wed) by sorokin (guest, #88478) [Link] (1 responses)

> statat(2) doesn't interact with the filename at all, does it?

Unfortunately it does. See "pathname" parameter: int fstatat(int dirfd, const char *pathname, struct stat *statbuf, int flags);

It stats the file "pathname" relative to directory "dirfd". Normally when readdir returns DT_UNKNOWN one has to statat the filename relative to the directory to figure out the real d_type.

Filesystems and case-insensitivity

Posted Nov 28, 2018 15:42 UTC (Wed) by bandrami (guest, #94229) [Link]

Gah, thanks. Looks like I was thinking of BSD.

Filesystems and case-insensitivity

Posted Nov 28, 2018 21:05 UTC (Wed) by madscientist (subscriber, #16861) [Link]

Are you thinking of fstat(2)?

Filesystems and case-insensitivity

Posted Nov 28, 2018 16:09 UTC (Wed) by sorokin (guest, #88478) [Link] (1 responses)

BTW, even if user want to work with files in case insensitive manner, that does not mean that underlying filesystem must be case insensitive.

For example I can imagine that save file dialog can ask the following question: "You are trying to create file Foo.txt while foo.txt already exists. Do you want to create another file that differs only in
letter case."

Correspondingly open file dialog first look for exact match and if no file is found search for file case insensitively. I would like having convenience feature like this even now.

My point is that this should be done only in limited number of user facing dialogs. Doing this for most existing system calls would be inefficient and can be incorrect if filenames are used as opaque keys.

Filesystems and case-insensitivity

Posted Nov 28, 2018 20:34 UTC (Wed) by saffroy (guest, #43999) [Link]

Doing this for most existing system calls would be inefficient

Well, it really depends on the use case (pun intended). Once I added case-insensitivity support to a proprietary filesystem specifically to improve performance, with great success.

Consider a case-sensitive folder with 10.000 files (this is not rare at all), shared over Samba. Every time a Samba client requests creation of a new file, and because the client requires case insensitivity, Samba has to scan the entire folder to check if the new name collides with an existing name. Yes, that's for every new file.

If the filesystem is actually case-insensitive, Samba can skip these scans, which is a huge performance boost.

Filesystems and case-insensitivity

Posted Nov 28, 2018 16:43 UTC (Wed) by smcv (subscriber, #53363) [Link] (2 responses)

> They are a place where programs store some internal data. Kind of (key, value) storage with hierarchical key.

For the SteamOS use case, it's desirable that this lookup can be case-insensitive: game developers typically do most of their testing and development on Windows, where opening "level3.MAP" will successfully find a file named "Level3.map". If the obvious port of that game fails to work on Linux, it makes Linux gaming look bad, and makes porting games to Linux less appealing.

Emulations of case-insensitive enviroments, like Wine and Samba, also need to match the case-insensitive behaviour of the environment they're emulating.

Filesystems and case-insensitivity

Posted Nov 28, 2018 20:28 UTC (Wed) by HenrikH (subscriber, #31152) [Link]

And it can be pointed out that this have actually happened (the Linux version does not start due to case inconsistencies) quite a few times on Steam so it's not just a theoretical problem.

Filesystems and case-insensitivity

Posted Nov 28, 2018 21:12 UTC (Wed) by madscientist (subscriber, #16861) [Link]

It can easily go the other way too: for example I've seen cases where someone created Git branches on Linux which differed only by case. That worked fine for them, but people on MacOS or Windows had difficult-to-understand problems (because Git branches exist as directories/files).

Filesystems and case-insensitivity

Posted Nov 28, 2018 14:41 UTC (Wed) by mgedmin (subscriber, #34497) [Link] (10 responses)

> He noted that European distributions typically use single-byte encodings

Am I living in a bubble? What are the European distributions that don't use UTF-8 by default in 2018?

Filesystems and case-insensitivity

Posted Nov 28, 2018 16:14 UTC (Wed) by niner (subscriber, #26151) [Link] (8 responses)

It's a bit disconcerting that someone working on text encoding support in the kernel has such grave misconceptions about encodings. I haven't seen anything but UTF-8 on a European Linux system in more than a decade. Chinese characters (I think he means CJK) are part of the basic multilingual plane and thus are encoded in 3 bytes by UTF-8.

The prevalent GBK encoding uses 2 bytes for such characters, so we're talking about a ~ 50 % increase in storage size. For text. I really wonder who cares about that in 2018. And even more I wonder, who'd care about the storage requirements for file names.

Filesystems and case-insensitivity

Posted Nov 28, 2018 16:58 UTC (Wed) by rahulsundaram (subscriber, #21946) [Link] (1 responses)

>It's a bit disconcerting that someone working on text encoding support in the kernel has such grave misconceptions about encodings

Careful there. This was not a comment from the developer working on text encoding support but from James Bottomley.

Filesystems and case-insensitivity

Posted Nov 28, 2018 17:02 UTC (Wed) by niner (subscriber, #26151) [Link]

Oh, thanks for the clarification! I misunderstood who the "He" in the sentence referred to.

Filesystems and case-insensitivity

Posted Nov 28, 2018 19:40 UTC (Wed) by roc (subscriber, #30627) [Link]

Also, how often would a filesystem have file names consisting *solely* of CJK?

For example, for Chinese Web pages UTF8 is a win over UTF16 because the majority of the text of a typical Chinese HTML document is actually ASCII.

Filesystems and case-insensitivity

Posted Nov 28, 2018 19:41 UTC (Wed) by roc (subscriber, #30627) [Link] (4 responses)

I think "the Chinese hate UTF-8" is "citation needed".

Filesystems and case-insensitivity

Posted Nov 29, 2018 0:36 UTC (Thu) by willy (subscriber, #9762) [Link] (3 responses)

The "Han unification" part of Unicode appears to have been controversial. https://en.m.wikipedia.org/wiki/Han_unification

But I don't think UTF-8 per se is controversial in China. More so in Russia where it is an evil tool of US oppression.

Filesystems and case-insensitivity

Posted Nov 29, 2018 2:04 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> More so in Russia where it is an evil tool of US oppression.
Uhm, no.

UTF-8 has finally solved the problem with the veritable zoo of commonly used Russian encodings (KOI-8, Win-1251, GOST, GOST-ALT, ISO).

Filesystems and case-insensitivity

Posted Nov 29, 2018 11:04 UTC (Thu) by andrewsh (subscriber, #71043) [Link] (1 responses)

Well, nobody (citation needed) ever used GOST or ISO encodings.

Filesystems and case-insensitivity

Posted Nov 29, 2018 11:07 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

GOST was used quite a lot in the pre-Internet era and surfaced periodically afterwards, in random places like receipt printer encodings.

ISO was used sometimes in the Internet. It was rare but it existed.

Filesystems and case-insensitivity

Posted Nov 28, 2018 20:30 UTC (Wed) by HenrikH (subscriber, #31152) [Link]

Came here to ask the same thing, have not seen anything other than UTF-8 here in Scandinavia for the last 10-15 years.

Filesystems and case-insensitivity

Posted Nov 28, 2018 14:59 UTC (Wed) by gioele (subscriber, #61675) [Link] (3 responses)

> Beyond that, "case" is really only defined in terms of an encoding

> Supporting case-insensitive file names requires the encoding-awareness changes in order to define what case folding means for a given character.

"Case" is properly defined only in terms of locale, not of encoding. Knowing the encoding (say, UTF-8+NFD vs UTF-16+NFKC) is necessary, but not sufficient. The user locale is needed as well.

In English "istanBUL" matches case-insensitively "Istanbul", in Turkish it does not. (In Turkish the uppercase version of "i" is "İ".)

What the developers could do is a kind of case-insensitive look-up that also clusters together "similar" letters. Defining which characters are similar opens, however, another can of worms (see `confusables.txt` from Unicode or all the discussions around IDNA and its Nameprep algorithm).

Maybe we should come up with another technical name for these locale-independent imprecise implementations of case insensitiveness?

Filesystems and case-insensitivity

Posted Nov 28, 2018 16:03 UTC (Wed) by anselm (subscriber, #2796) [Link]

Here's an interesting post by James Bennett (Django core developer) on the topic of “case”: Truths programmers should know about case

Filesystems and case-insensitivity

Posted Dec 2, 2018 16:42 UTC (Sun) by epa (subscriber, #39769) [Link] (1 responses)

Is there any reason not to treat i, İ, I, and ı the same for case-folding purposes on the file system?

I am not asking whether they are the same in all uses. I know that in Turkish i and ı are different letters. What I'm suggesting is that for making a case-insensitive filesystem lookup -- where you have already waved goodbye to a strict 1-1 mapping between byte sequences and directory entries -- it surely doesn't matter that much to gloss over the distinction and treat all these four characters the same. Similarly I would consider it a feature, not a bug, if accented characters could be preserved in filenames, but ignored when matching. There are pairs of words in German that differ only in accent, but it's very unlikely an accent would be the only difference between two human-written document names.

Now, you may with some justice argue that loose matching like this belongs in user space, not the kernel. But in the end it's not my preferences or anyone else's that matter. What matters is to efficiently implement the existing (de facto or de jure) standards. What behaviour is Samba required to support with the Turkish uppercase and lowercase letters? The kernel should provide the semantics that Samba needs so it doesn't have to laboriously scan the whole directory to match a filename.

Filesystems and case-insensitivity

Posted Dec 2, 2018 17:19 UTC (Sun) by gioele (subscriber, #61675) [Link]

> Is there any reason not to treat i, İ, I, and ı the same for case-folding purposes on the file system?

Sure they could. But doing it is hard (and computationally expensive).

This is what I meant with

> What the developers could do is a kind of case-insensitive look-up that also clusters together "similar" letters. Defining which characters are similar opens, however, another can of worms (see `confusables.txt` from Unicode or all the discussions around IDNA and its Nameprep algorithm).

Filesystems and case-insensitivity

Posted Nov 28, 2018 15:34 UTC (Wed) by bokr (guest, #58369) [Link] (1 responses)

[15:59 ~/bs]$ lsblk -o NAME,TYPE,FSTYPE,MOUNTPOINT /dev/sda
NAME TYPE FSTYPE MOUNTPOINT
sda disk
├─sda1 part vfat /boot
├─sda2 part ext4 /
├─sda3 part
└─sda4 part ext4
[16:01 ~/bs]$ ls -ld /boot/ef*
ls: cannot access '/boot/ef*': No such file or directory
[16:02 ~/bs]$ ls -ld /boot/E*
drwxr-xr-x 7 root root 4096 May 6 2018 /boot/EFI
[16:04 ~/bs]$ ls -ld /boot/V*
ls: cannot access '/boot/V*': No such file or directory
[16:06 ~/bs]$ ls -ld /boot/v*
-rwxr-xr-x 1 root root 5838720 Nov 23 10:05 /boot/vmlinuz-linux
[16:07 ~/bs]$ ls -ld /boot/VMLINUZ-LINUX
-rwxr-xr-x 1 root root 5838720 Nov 23 10:05 /boot/VMLINUZ-LINUX

It appears that globbing is case-sensitive but a complete name is
case-insensitive.

BTW, I like a default case-insensitive search like you get from emacs,
whereas query-replace using the same regex works case-sensitively unless
you say otherwise.

Filesystems and case-insensitivity

Posted Nov 28, 2018 17:31 UTC (Wed) by mina86 (guest, #68442) [Link]

The difference is because globing is done by shell which assumes case-sensitive file system while opening and statting is done by the kernel.

Filesystems and case-insensitivity

Posted Nov 28, 2018 16:11 UTC (Wed) by willy (subscriber, #9762) [Link] (1 responses)

Chinese characters use a 3-byte encoding, not 4. The CJK ideographs are U+4E00 to U+9FFF.

There are Extended blocks in U+20000 space which will use 4 bytes, but my understanding is that those are rare characters (the most common 27,000 characters are below FFFF).

The language groups who were worst affected by UTF-8 were Cyrillic and Greek who now need two bytes for every letter. But I don't see what better choice there was.

Filesystems and case-insensitivity

Posted Nov 29, 2018 11:17 UTC (Thu) by andrewsh (subscriber, #71043) [Link]

There wasn’t, since before UTF-8 there were at least three popular one-byte encoding with totally incompatible character layouts.

Historically, DOS systems had three encodings for Russian, of which two were almost never used since one of them wasn’t compatible with line/block graphics characters (so no Norton Commander for its users), and another one was developed after the third one gained widespread usage. Of that remaining one, which Microsoft branded as CP866, there were multiple versions differing in certain characters missing or present (e.g. ё vs ± or Ў vs ÷).

Next, Windows came with this wonderful distinction between ‘ANSI’, ‘OEM’ and ‘Unicode’. ‘OEM’, of course, was CP866 for Russian-localised Windows but something else in other versions, and ‘ANSI’ was CP1251 (again, in Russian Windows only). So historically there's been created a lot of documents in those two encodings. Most importantly, most ZIP archives had file names encoded in CP866, but some of them in CP1251.

Since ‘ANSI’ was the default in Windows XP and that is still being used in lots of places, such documents are still being produced as we speak™.

Oh, and the third encoding an absolute minority still uses is KOI-8R/U, which only die-hard Unixoids use these days because it allows them to strip bit 7 and still being able to read text in Russian since that’s been the design decision when the charset was developed (they mapped most letters to their phonetic equivalents in inverse case but with bit 7 set: A → а, B → б, C → ц, but, for example, Q → я and V → ж). That encoding has traditionally been used on Unix and Linux systems in the past.

glibc?

Posted Nov 28, 2018 17:24 UTC (Wed) by smurf (subscriber, #17840) [Link]

> POSIX does not have a way for filesystems to communicate the encoding of their file names; if that existed, glibc could handle the differences.

Meh? Above, it was said that Android's userspace hack is … bad. So why should doing it in glibc be any different?

Filesystems and case-insensitivity

Posted Nov 28, 2018 17:38 UTC (Wed) by dgm (subscriber, #49227) [Link] (5 responses)

> Supporting case-insensitive file names requires the encoding-awareness changes in order to define what case folding means for a given character.

Does it? Why can't you use an attribute to store the case-folded name? Or use the case-folded name as the file name and add a name-as-provided attribute? This way you can move case interpretation to user space, where it belongs.

What I'm missing?

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:15 UTC (Thu) by alonz (subscriber, #815) [Link] (4 responses)

> use the case-folded name as the file name and add a name-as-provided attribute

This actually looks like one of the sanest proposals – I really wonder why the existing user-space solutions are not using this scheme.

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:43 UTC (Thu) by smurf (subscriber, #17840) [Link] (1 responses)

They were written before random user-specified attributes were a thing, much less one you could rely on to be present (more or less). Even today you can easily create file systems without xattr support.

Filesystems and case-insensitivity

Posted Nov 29, 2018 15:11 UTC (Thu) by alonz (subscriber, #815) [Link]

Since kernel support for case insensitivity is currently even less if a thing, I still wonder...

(I may be tempted to write a PoC of this idea and see how it performs, just for curiosity's sake)

Filesystems and case-insensitivity

Posted Nov 29, 2018 15:57 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (1 responses)

I guess name-as-provided is for displaying in user interface?

% getfattr --name=user.name-as-provided malware.sh 
# file: malware.sh
user.name-as-provided="Safe.pdf"

Filesystems and case-insensitivity

Posted Nov 29, 2018 18:34 UTC (Thu) by smurf (subscriber, #17840) [Link]

Well, that's easy enough to fix – verify that the xattr matches the filename before using it.

A non-malicious source of the same problem is somebody renaming the file with an old non-extended-filename-compatible tool (or libc).

Filesystems and case-insensitivity

Posted Nov 28, 2018 17:54 UTC (Wed) by tnoo (subscriber, #20427) [Link] (10 responses)

What a weird idea! How will a case-insensitive filesystem deal with filenames generated by a hash algorithm (as used e.g. by git and many other programs). If this is really needed, it should be a user-space layer, but file systems should handle unique byte sequences uniquely.

Filesystems and case-insensitivity

Posted Nov 28, 2018 20:42 UTC (Wed) by saffroy (guest, #43999) [Link] (2 responses)

Consider hash values 0xabcdef and 0xABCDEF: well, they are the same value. :) Hash-based names in hex actually don't care about case.

Besides, see my other comment about when and why it is needed.

Filesystems and case-insensitivity

Posted Nov 29, 2018 17:05 UTC (Thu) by cesarb (subscriber, #6266) [Link] (1 responses)

Not if they're base64: abcdef and ABCDEF are different values in base64. Git can get away with base16 because it uses a short 160-bit hash, but other hash algorithms have much longer outputs (256 or even 512 bits).

Filesystems and case-insensitivity

Posted Nov 29, 2018 17:29 UTC (Thu) by bfields (subscriber, #19510) [Link]

What's an example of an application that does that?

Filesystems and case-insensitivity

Posted Nov 28, 2018 21:00 UTC (Wed) by smurf (subscriber, #17840) [Link] (3 responses)

No, you can't do it in userspace. The kernel does not have directory locking, which you'd need for an atomic "create a file while making sure that there is no other file with the same name (case-insensitive)" operation. Thus if you create "makefile" you need to scan the whole directory for any other matching filename, and that *still* doesn't stop anybody from introducing consistency errors (you can race to create "Makefile" and "MakeFile" at the same time).

Filesystems and case-insensitivity

Posted Nov 29, 2018 9:06 UTC (Thu) by dgm (subscriber, #49227) [Link] (2 responses)

> No, you can't do it in userspace. The kernel does not have directory locking, which you'd need for an atomic "create a file while making sure that there is no other file with the same name (case-insensitive)" operation.

Yes, you can. You only need to use the canonical representation of the file name (the case-folded one) and check that file name.

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:20 UTC (Thu) by smurf (subscriber, #17840) [Link] (1 responses)

You presume that there *is* a canonical representation of a file name. For case-preserving file systems, there is no such thing.

Filesystems and case-insensitivity

Posted Nov 29, 2018 15:23 UTC (Thu) by dgm (subscriber, #49227) [Link]

Well, nobody was talking about case preservation, but now that you mention it, you have three options:
- use the filesystem as case preserving (and give up case insensitivity)
- use the filesystem as canse-insensitive (and forget about case preservation)
- use xattrs
And all that, without touching a single line of kernel code.

Filesystems and case-insensitivity

Posted Nov 29, 2018 1:02 UTC (Thu) by bfields (subscriber, #19510) [Link] (2 responses)

The filenames generated by git just use 0-9 and a-f, I don't see how case insensitivity would cause any problems.

Git, like most applications, already has to be prepared to work on case-insensitive filesystems.

Filesystems and case-insensitivity

Posted Nov 29, 2018 4:06 UTC (Thu) by smurf (subscriber, #17840) [Link] (1 responses)

The problem is tags, which once ended up as file and directory names.
Git has a "packed-refs" file these days, so the problem *should* be solved, but I haven't checked.

Filesystems and case-insensitivity

Posted Nov 29, 2018 15:31 UTC (Thu) by bfields (subscriber, #19510) [Link]

I haven't looked in a while either, but I think like packed objects it's just an optimization, and refs can also be created in unpacked formats.

File and directory names are a problem too, of course. If a directory you're storing in git includes both foo and FOO, then you'll have a problem when you try to check it out on a case-insensitive filesystem.

I don't think that's really fixable; some people actually do have such content which they need to track in git, others can't deal with it, it's up to the user to decide what they care about.

But git works on case-insensitive filesystems if the stuff you put into it does.

Filesystems and case-insensitivity

Posted Nov 28, 2018 19:09 UTC (Wed) by perennialmind (guest, #45817) [Link] (27 responses)

Another problem area is how to deal with invalid byte sequences. He proposes falling back to the previous behavior, just treating the names as sequences of bytes, when a sequence is invalid for the encoding.

Please no – this is the best opportunity yet to outlaw pernicious byte sequences! Once you decide to accept and present a set of path components as text, why would you want to allow mixing in random binary garbage? Once you've taken the step of blocking a new Makefile when there's a makefile, you clear the way to refusing to accept linebreaks, escape characters, and all the other control characters. Windows already blocks those, so it's a portability win.

The last time I read anything on the topic was an old lwn article⁽¹⁾ on a proposal by David Wheeler⁽²⁾ . Back then it was clear that there would need to be a way to opt-in to such screening. Maybe that happened when I wasn't looking? If not, this looks like the perfect time.

Filesystems and case-insensitivity

Posted Nov 28, 2018 20:20 UTC (Wed) by saffroy (guest, #43999) [Link] (10 responses)

I agree that allowing invalid byte sequences seems dangerous.

However, I wouldn't go as far as forbidding valid characters, that would be a different feature; blocking sequences that invalid for the selected encoding is sufficient.

Filesystems and case-insensitivity

Posted Nov 28, 2018 21:07 UTC (Wed) by perennialmind (guest, #45817) [Link] (1 responses)

If I understand you correctly, I agree. Adding filename-as-natural-language-text semantics rounds off one of the many sharp edges in shell scripting. Adding bumpers around the rest is a separate task.

If the semantics really are to be changed – if a differentiable set of path elements are to contain text and only text – that's a useful feature from an application developer's perspective. If it's to be a new kind of hard-to-predict weirdness, that's less useful.

I'd prefer it if text-only filenames were limited to printable graphemes only. That might be too high a bar. I would hope that control characters (C0,C1) would be disallowed. I don't consider \x7F (DELETE) or \x09 (TAB) to be valid in a natural-language name.

Filesystems and case-insensitivity

Posted Dec 6, 2018 10:02 UTC (Thu) by Wol (subscriber, #4433) [Link]

I worked with a system that had filenames of the form <space><backspace>NNNN

Bearing in mind users weren't supposed to go anywhere near them, it was a pretty good way of stopping people scanning the filesystem and messing about with them. I agree for user visible files, it's a good idea, but not all files are meant to be user visible and some of them can't be hidden.

Cheers,
Wol

Filesystems and case-insensitivity

Posted Nov 28, 2018 22:10 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (7 responses)

I've yet to hear any good justification for allowing a formfeed in a filename.

Filesystems and case-insensitivity

Posted Nov 29, 2018 6:39 UTC (Thu) by lkundrak (subscriber, #43452) [Link] (1 responses)

Is "there could already be files with a form feed in their name" a good justification?

Filesystems and case-insensitivity

Posted Nov 29, 2018 21:04 UTC (Thu) by perennialmind (guest, #45817) [Link]

That would be a potential problem for a mount option, but not when setting a per-directory attribute, since the directory must be empty. The same problem applies to the name collision problem though. Imposing new semantics on a filesystem would be problematic. Maybe with a superblock change? Meh. The per-directory attribute just seems cleaner overall.

Filesystems and case-insensitivity

Posted Nov 29, 2018 9:11 UTC (Thu) by dgm (subscriber, #49227) [Link] (4 responses)

> I've yet to hear any good justification for allowing a formfeed in a filename.

You cannot have formfeeds in a file name. You can have bytes with the decimal value 12, but reading it as a formfeed or somethig else is completely up to you.

Filesystems and case-insensitivity

Posted Nov 29, 2018 10:46 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

What's the difference in practice then?

Seriously, the idiocy with free-form filenames should be fixed.

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:43 UTC (Thu) by hkario (subscriber, #94864) [Link] (2 responses)

because that file could have been created on a system working in CP437 where it would be shown as ♀

Just because a byte string is an invalid sequence in one encoding doesn't mean it's an invalid sequence in all encodings.

Filesystems and case-insensitivity

Posted Nov 29, 2018 18:28 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

UTF-8 won. We should start giving it its well-earned victory parade. Don't you that, after we're finished banning non-UTF-8 encodings, and after we ban illegal or bizarre code sequences, and after we start normalizing filenames into consistent form, we'll end up in a better place? If in the process of doing that we can't access files called ♀ from old volumes without special compatibility mount options, so be it.

Filesystems and case-insensitivity

Posted Nov 29, 2018 20:47 UTC (Thu) by perennialmind (guest, #45817) [Link]

That's why I thought it was a brilliant choice to require that directories be empty in order to switch on the "text mode dentries" attribute. You sidestep the "reinterpret" problem in trade for a save/copy errors that are easy to surface to end users. I'm not sure how overlayfs union mounts would work though.

Filesystems and case-insensitivity

Posted Nov 28, 2018 21:06 UTC (Wed) by smurf (subscriber, #17840) [Link] (15 responses)

It's easy to encode invalid byte sequences so that they survive a round trip through Unicode / UTF-8 – you mis-appropriate the surrogates. The actual higher-level semantics of that, though, are fraught with corner cases you *really* don't want to deal with.

Basically IMHO there are two sane choices – (a) the current situation: the kernel does not attach any semantics to any bytes other than '/' and '\0' (thus there is no chance for case insensitivity beyond ASCII), or (b) you use clean and preferably pre-normalized UTF-8 on the userspace/kernel boundary, outlaw anything nonconforming, and do everything else in userspace. Anything else is a recipe for long-term desaster.

Filesystems and case-insensitivity

Posted Nov 28, 2018 22:28 UTC (Wed) by perennialmind (guest, #45817) [Link] (13 responses)

Newline, tab, and bel codepoints are perfectly valid UTF-8 plain text, but I'd prefer to push that out to userspace as well. I don't much care whether curl -O gives me filenames with spaces or %20s, but I do object if I see files with newlines in the names. I don't mind if I'm left with sneaky left-to-right, right-to-left marks or explicitly red hearts. I see the need for parentheses and question marks...

... but not control characters. To me, a natural language filename would comprise user-perceived characters and the one true space space character (U+0020). Flexibility beyond that does more harm than good. Leave those footguns to the bytestring paths. 😉

Filesystems and case-insensitivity

Posted Nov 29, 2018 13:41 UTC (Thu) by utoddl (guest, #1232) [Link] (3 responses)

I was with you until you got to spaces. It's only wishing of course, but I wish spaces in file names would go away. Personal peeve.

Filesystems and case-insensitivity

Posted Nov 30, 2018 9:04 UTC (Fri) by jezuch (subscriber, #52988) [Link] (2 responses)

Spaces in file names are wonderful. You can name your files like a human being, not a slave of poorly written shell scripts with broken quoting :) (I have a pet peeve too)

Filesystems and case-insensitivity

Posted Dec 3, 2018 12:30 UTC (Mon) by ale2018 (guest, #128727) [Link] (1 responses)

Ah, poorly written shell scripts, eh? Because you obviously think that being slave of over-complicated command lines is fine? A good percentage of my command lines start with find . -name whatever | xargs... Yes, I know I can write -print0 and -0, I do that when I write shell scripts.

When I find a filename with spaces I just move it away.

For the record, the normalization step and control characters were never taken care of. For example:

    ~$ touch aaabd $(printf 'aaabc\bd')  "$(printf 'aaabc\nd')"
    ~$ ls -lt | head -5
    total 3686968
    -rw-r--r--  1 ale ale                 0 Dec  3 13:21 aaabd
    -rw-r--r--  1 ale ale                 0 Dec  3 13:21 aaabc
    d
    -rw-r--r--  1 ale ale                 0 Dec  3 13:21 aaabd

Control characters where never forbidden. Consider that human beings are sometimes uncertain about the name they're typing and type a backspace (\b) in it. So, why isn't that beautiful too? Perhaps, users should have a clue. In the words of the Ancient Philosophy, rubbish in, rubbish out.

Filesystems and case-insensitivity

Posted Dec 3, 2018 20:24 UTC (Mon) by flussence (guest, #85566) [Link]

ls took care of that a few years ago…

~/test $ ls
'aaabc'$'\b''d'  'aaabc'$'\n''d'   aaabd
~/test $ ls --version
ls (GNU coreutils) 8.30
Packaged by Gentoo (8.30 (p01))

Filesystems and case-insensitivity

Posted Nov 30, 2018 9:09 UTC (Fri) by jezuch (subscriber, #52988) [Link] (3 responses)

I guess the concept of control characters should have been retired long time ago. I also think that it was a huge mistake to bring them to UTF-8 along with the rest of ASCII. But I'm pretty sure someone will explain to me that they are in fact critical and there are further control characters in the Unicode spec anyway :)

Filesystems and case-insensitivity

Posted Nov 30, 2018 16:25 UTC (Fri) by perennialmind (guest, #45817) [Link] (2 responses)

You mean end-of-string delimiters, end-of-line delimiters, tabs, and the codes needed for controlling a terminal such as escape and erase? Setting aside hurdles to adoption, one can imagine hoisting those into markup. Perhaps there's even a spec for plainer-than-plain-text for when such markup exists (i.e. HTML). If so, it might be perfect for filenames.

ASCII compatibility was the selling point for UTF-8. Beyond the above, even the oddballs are still in use. Take for example "group separator" which stands in for FNC1 in barcodes.

Somebody else will have to defend the C1 block though.

Filesystems and case-insensitivity

Posted Dec 1, 2018 11:24 UTC (Sat) by jezuch (subscriber, #52988) [Link] (1 responses)

I mean all the bytes below 0x20. This is not text, they have no place in a *character* encoding. Apart from that I'm totally fine with ASCII compatibility, even though it's typically American culturally insensitive invention ;)

Filesystems and case-insensitivity

Posted Dec 6, 2018 10:16 UTC (Thu) by Wol (subscriber, #4433) [Link]

I believe there are two control characters RS1 and RS2? Basically standing for "Repeat String"? Which were used on a system I worked on, and actually were a damn good fix for "how many characters does a tab stand for?". So most lines in my FORTRAN source code would have been physically stored on disk as "<RS1><6>code..."

And if you had a lot of spaces it saved a fair few bytes over tab-encoding, plus being completely unambiguous.

Cheers,
Wol

Filesystems and case-insensitivity

Posted Dec 4, 2018 8:13 UTC (Tue) by pr1268 (guest, #24648) [Link] (4 responses)

one true space space character (U+0020)

Um, there's more than one space: ' ' and ' '. One is \u0020 (good ol' ASCII 0x20) and the other is \u00a0.

I was personally burned by the second "space" above appearing in an Excel spreadsheet (to the exclusion of the "one true space character" you mentioned). >:-(

Filesystems and case-insensitivity

Posted Dec 4, 2018 10:24 UTC (Tue) by smurf (subscriber, #17840) [Link] (3 responses)

Don't worry – Unicode has a bunch more spaces, including zero-width ones.

On second thought: do worry.

Filesystems and case-insensitivity

Posted Dec 4, 2018 13:27 UTC (Tue) by hummassa (subscriber, #307) [Link] (2 responses)

The initial argument still holds: there is no reasonable rationale for those other space characters (including U+00a0) in file names.

Filesystems and case-insensitivity

Posted Dec 12, 2018 23:45 UTC (Wed) by pr1268 (guest, #24648) [Link] (1 responses)

there is no reasonable rationale for those other space characters (including U+00a0) in file names.

Agreed, but try telling that to those fools who auto-generated the spreadsheet with \u00a0 spaces. </angry rant>

Filesystems and case-insensitivity

Posted Dec 13, 2018 10:33 UTC (Thu) by james (subscriber, #1325) [Link]

Would it calm your anger to point out that LibreOffice can search using regexps?

Filesystems and case-insensitivity

Posted Nov 29, 2018 4:29 UTC (Thu) by raven667 (subscriber, #5198) [Link]

This was basically my thinking too, I'm an amateur when it comes to filesystem/vfs design or kernel/user interface, it seems you could limit the scope of what the kernel tries to do to something reliably implementable, such as storing an encoding hint and validating the encoding on read/write, that would be useful for a reference implementation in userspace to handle normalization and locale and all the wooly corner cases that will probably require frequent patching, but I'm not sure that you could outlaw anything at the kernel interface except for strings that are not valid for the hinted encoding, or a very small blacklist of control characters. I don't know what the state of the art is in userspace, but it seems that a lot of these challenges have already been faced by web browsers, database engines and others, and it would make sense to me to re-use as much of that experience, or even implementations, as possible to build a consensus, conventions and a reference implementation for the libc's and others to use.

Filesystems and case-insensitivity

Posted Nov 28, 2018 20:54 UTC (Wed) by saffroy (guest, #43999) [Link]

The question of negative cache entries is interesting, and I confirm it is very important for performance.

I didn't look at the dentry cache in a very long time, though I suppose that looking up a dcache entry by name is essentially: compute a hash of the (name, directory ino) tuple, then look it up in a hash table, by comparing the requested tuple with the cached tuples in the bucket.

Then, I suppose it is sufficient to have a per-filesystem hash function, which can compute a hash over the normalized name. And a per-filesystem comparison function can then compare the requested tuple with the cached tuples.

Probably that would work, right?

Filesystems and case-insensitivity

Posted Nov 29, 2018 8:30 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (11 responses)

It's pretty sad how encoding problems, have to be presented via a case sensitivity bias, before US devs even consider them. F-up all non-English languages of the world: NOT A PROBLEM. Mistreating English casing: HUGE PROBLEM.

Anyway, hope this gets fixed. Transition to UTF-8 was awful for *x filesystems, I sure hope there won't be a v2 with wide encoding problems added to the mix whenever UTF-8 gets deprecated in favour of something better.

Filesystems and case-insensitivity

Posted Nov 29, 2018 9:15 UTC (Thu) by dgm (subscriber, #49227) [Link] (8 responses)

Adding casing to the kernel is a sure recipe for intense pain *when* (not if) the next transition happens. And all for solving a non-existing problem. Oh vey.

Filesystems and case-insensitivity

Posted Nov 29, 2018 11:38 UTC (Thu) by eru (subscriber, #2753) [Link] (7 responses)

*when* (not if) the next transition happens

I would hope that is never. UTF-8 can represent all characters now in practical use. The main risk is designing emojis going totally out of hand, and they insist each of them should have a UNICODE code point... oh wait...

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:41 UTC (Thu) by chithanh (guest, #52801) [Link] (6 responses)

> I would hope that is never. UTF-8 can represent all characters now in practical use.

That is not correct. In particular, Unicode (and by extension UTF-8) is deficient regarding some characters in African languages, due to the Unicode consortium's policy regarding precomposed characters vs. combining diacritics. They don't want to introduce new equivalences.

Filesystems and case-insensitivity

Posted May 29, 2019 23:00 UTC (Wed) by Serentty (guest, #132335) [Link] (5 responses)

This is not a deficiency with Unicode. Precomposed characters such as É have only ever been encoded in Unicode as a matter of compatibility with legacy encodings, and wouldn't have been included if not for this. They continue to be used because they save you a few bytes, which you might as well go for even if compression makes it moot in the end. Combining diacritics have always been the preferred method as they are much more flexible, and allow users to compose arbitrary characters without needing to constantly update their software or risk mojibake. Many scripts in Unicode work entirely though combining diacritics and it works just fine; the Indic scripts are good examples. It should be noted that the legacy encodings for these scripts usually worked that way as well. Conformant implementations will treat composed and decomposed characters identically, so the advantage of going down the rabbit hole of trying to provide every precomposed character anyone might ever want isn't really worth it when composition works just as well. If you notice that combining diacritics aren't giving you the nice hand-tweaked glyphs that precomposed characters are, and you end up with the diacritic looking all wrong, take it up with the developer of the text renderer or the font, because that's not how Unicode is supposed to work.

Filesystems and case-insensitivity

Posted May 30, 2019 14:18 UTC (Thu) by smurf (subscriber, #17840) [Link] (4 responses)

Hmm. If that rule had actually been followed, we'd still have room on the base plane (i.e. codepoints below 65536).

(How many primitives would you need for Chinese?)

On the other hand, in that case we wouldn't all use UTF-8 by now – simply because that would require twice the storage space for Chnese text, more or less. Nowadays that doesn't really matter, but at the time it was a problem.

Filesystems and case-insensitivity

Posted May 30, 2019 14:54 UTC (Thu) by excors (subscriber, #95769) [Link] (2 responses)

https://en.wikipedia.org/wiki/Template:CJK_ideographs_in_... has a helpful list with the numbers of CJK codepoints, and I assume only the earliest ones were needed for legacy compatibility - the rest were presumably added because they couldn't already be represented. Recently Unicode 10.0 added "CJK Extension F" (7473 codepoints) so it seems they're still not finished. Then there's all the other scripts being added, like Tangut ("a major historic script of China") with another ~7000 codepoints. And about 1700 emojis (https://unicode.org/emoji/charts/emoji-list.html).

Maybe the 64K limit could have lasted for many more years if they had made some different design choices early on, but given the goal of being a universal standard for all text, it seems inevitable the limit would be broken eventually. It's better to have broken it earlier than later.

Filesystems and case-insensitivity

Posted May 31, 2019 15:06 UTC (Fri) by smurf (subscriber, #17840) [Link]

True, that.

Seems that quite a few of Chinese people with interesting names (i.e. using archaic characters) suddenly couldn't get an official document any more because, surprise, their name wasn't in the "official" charset …

Filesystems and case-insensitivity

Posted May 31, 2019 18:37 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Technically, most of CJK characters can be decomposed into simpler characters. About 70% of Mandarin characters follow the "radical-phonetic" model and can theoretically be composed.

Filesystems and case-insensitivity

Posted Jun 6, 2019 5:09 UTC (Thu) by Serentty (guest, #132335) [Link]

Encoding Chinese text based on primitives is the Holy Grail of Chinese text encoding, but no one has actually been able to come up with a realistic solution for it, and it's probably just not realistic. Korean text on the other hand is really easy to encode based on primitives, as it's just 22 letters combined in predictable ways.

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:29 UTC (Thu) by smurf (subscriber, #17840) [Link] (1 responses)

> F-up all non-English languages of the world: NOT A PROBLEM.

Before UTF-8, there never was an encoding that could represent "all non-English languages". At most it could store one other language, or ten (Windows and its brain-dead decision to use 16-bit characters), and that is a subset of Unicode/utf-8.

> whenever UTF-8 gets deprecated

It won't be. There's no reason at all to do that, and several billion reasons not to.

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:43 UTC (Thu) by eru (subscriber, #2753) [Link]

(Windows and its brain-dead decision to use 16-bit characters), and that is a subset of Unicode/utf-8.

To be fair, that was the UNICODE spec at the time. Similarly, Java originally used 16-bit characters (and a char type is still 16 bits wide there). Now Java internally encodes strings as UTF-16 in order to support the expansion of UNICODE.

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:16 UTC (Thu) by skissane (subscriber, #38675) [Link] (1 responses)

> POSIX does not have a way for filesystems to communicate the encoding of their file names; if that existed, glibc could handle the differences.

Maybe it should? Or, maybe, at least, Linux kernel should add some API to report the filesystem path name encoding. If Linux does it, maybe it could be added to POSIX by Austin Group.

Or, something I'd like even better – it must always be UTF-8, and filesystem has to translate to/from if anything else. But, that's probably going to cause backwards compatibility issues for some people, whereas just reporting filesystem encoding to user space won't.

Filesystems and case-insensitivity

Posted Dec 6, 2018 12:34 UTC (Thu) by rleigh (guest, #14622) [Link]

Example for ZFS:

    % zfs get all rpool/ROOT/default
    NAME                PROPERTY              VALUE                  SOURCE
    rpool/ROOT/default  type                  filesystem             -
    rpool/ROOT/default  creation              Sun Jun 12 10:46 2016  -
    …
    rpool/ROOT/default  utf8only              on                     -
    rpool/ROOT/default  normalization         formD                  -
    rpool/ROOT/default  casesensitivity       sensitive              -
    …

Case sensitivity is selectable. You can force it to only store valid UTF-8, and you can choose whether the UTF-8 is normalised or not. Or you can allow it to store anything. This gives you full backward compatibility with historical UNIX behaviour should you want it, or you can restrict it case sensitive normalised UTF-8. Having this selectable on a per-filesystem basis gives you a great deal of flexibility, and it defaults to something sensible (the above are the defaults).

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:21 UTC (Thu) by Sesse (subscriber, #53779) [Link] (3 responses)

So, one thing is encoding, but what about collation? If you want correct Unicode case handling, you absolutely need to know which locale you're in. The common example: In English, i and I are the same letter with different case. In Turkish, they are empathically not (the lowercase of I is ı, the uppercase of i is İ, and ı and i are as different letters as v and w are in English).

The only way I know of to deal with these kinds of issues is to specify a collation when creating the filesystem. Windows does (and many other things) this based on installation language, which causes all kinds of funky issues on large installations where you could have multiple users with different languages.

Filesystems and case-insensitivity

Posted Nov 29, 2018 14:23 UTC (Thu) by willy (subscriber, #9762) [Link] (2 responses)

Collation is handled in userspace. There's no guarantee what order getdents() will return filenames in.

Filesystems and case-insensitivity

Posted Nov 29, 2018 14:26 UTC (Thu) by Sesse (subscriber, #53779) [Link] (1 responses)

There are two parts to collation; ordering and equality. (If you have the former, you also have the latter.) I'm fine with ordering not being handled by the kernel, but equality needs to be. And if you want case-insensitivity, equality is locale-dependent.

Filesystems and case-insensitivity

Posted Nov 30, 2018 17:55 UTC (Fri) by k8to (guest, #15413) [Link]

Indeed, the existence of the file accessed by a given byte sequence will vary depending on locale. This is true for many situations, not just the rather clear turkish one.

This leads to a problem where a user or process in one locale should get different results from the kernel than another. This traditionally was viewed as a rathole and I've seen many situations where osx behaves in bizarre ways due to this sort of thing.

The proposal here seems to be to push the rules into the filesystem or directory, which effectively means having locale behavior independent of the user / process, which means we will get a fun matrix of file name locale vs user locale. I'm not a fan.

Filesystems and case-insensitivity

Posted Nov 29, 2018 17:28 UTC (Thu) by cesarb (subscriber, #6266) [Link]

> There may be some user-space breakage due to normalization or case folding of file names that will need to be handled as well.

I wonder how many of these will be security vulnerabilities. We already had one in git not too long ago (CVE-2014-9390 git: arbitrary command execution vulnerability on case-insensitive file systems).

Filesystems and case-insensitivity

Posted Nov 30, 2018 17:38 UTC (Fri) by ScottMinster (subscriber, #67541) [Link] (5 responses)

What is the real use case of a case insensitive file system? I understand the interoperability use case of working with case insensitive file systems or Samba clients. And if the answer is really just "other systems do it and they can't change because it would break too much", that makes sense.

But leaving that aside for the moment, what is the gain? I've been working on case sensitive Linux file systems for many years, and never felt the need to have "level1.MAP" really load "level1.map". I've occasionally had to deal with writing software that expects extensions in lower case and been given files with those extensions in upper case, and that is annoying, but it's more due to laziness on the file creator's part of not using the standard extension case.

The only two justifications I can come up with for wanting case insensitivity is to avoid problems with unexpectedly cased files and to avoid user confusion (i.e., "Document1.doc" and "document1.doc" being different files). Those are worthy goals, but it seems like there are so many tricky problems with case insensitivity that it is hardly worth the trouble. It's relatively easy in English, but as many people point out, there are problems in many other languages.

But even in English, it could cause problems. Can you "mv makefile Makefile" in a case insensitive filesystem, or would you get the error "'makefile' and 'Makefile' are the same file"? Also, as another poster pointed out, globbing in various programs would likely have to change to be consistent.

And once you enable it, you can never really disable it without inevitably breaking programs.

So why do systems like Windows and MacOS do it? Did they underestimate the difficulty and are stuck with that decision?

I could see how it could be useful for things like Samba servers, but given all the complicated edge cases, it doesn't seem like it's a good idea for general use. Though once it's working for Samba, some distribution will likely turn it on everywhere, to try to be more user friendly.

Filesystems and case-insensitivity

Posted Nov 30, 2018 18:36 UTC (Fri) by mpr22 (subscriber, #60784) [Link]

Windows has to be case-insensitive because older versions of Windows were case-insensitive because they were compatible with MS-DOS and MS-DOS was case-insensitive.

And MS-DOS was written with the (probably unconscious) assumption that natural-language text was in ASCII, where case-insensitivity is cheap to implement.

Filesystems and case-insensitivity

Posted Dec 7, 2018 11:09 UTC (Fri) by Wol (subscriber, #4433) [Link] (3 responses)

> But leaving that aside for the moment, what is the gain? I've been working on case sensitive Linux file systems for many years, and never felt the need to have "level1.MAP" really load "level1.map".

Because you're a programmer used to thinking in byte strings.

I *still* have problems because users insist on knowing whether email addresses have capital letters or not (they are case-insensitive, for historical reasons, because a lot of the early systems mangled case).

So the short answer is, YOU may not feel the gain, but a lot of other people WILL.

(Along the same lines, I remember being sent a second copy of some newsletter because "some people said they couldn't read the attachment". Ie pretty much all Windows systems, because the sender had somehow lost the extension and those systems didn't recognise the file "newsletter" as a pdf. Of course, my gentoo system didn't give a monkeys :-)

Cheers,
Wol

Filesystems and case-insensitivity

Posted Dec 7, 2018 11:45 UTC (Fri) by gioele (subscriber, #61675) [Link] (2 responses)

> I *still* have problems because users insist on knowing whether email addresses have capital letters or not (they are case-insensitive, for historical reasons, because a lot of the early systems mangled case).

These users are right: the local-part of an email address _is_ case sensitive. Only the domain is case insensitive.

RFC 5321, section 2.4 [1]:

> The local-part of a mailbox MUST BE treated as case sensitive. Therefore, SMTP implementations MUST take care to preserve the case of mailbox local-parts. In particular, for some hosts, the user "smith" is different from the user "Smith". However, exploiting the case sensitivity of mailbox local-parts impedes interoperability and is discouraged. Mailbox domains follow normal DNS rules and are hence not case sensitive.

[1] https://tools.ietf.org/html/rfc5321#section-2.4

Filesystems and case-insensitivity

Posted Dec 7, 2018 16:31 UTC (Fri) by smurf (subscriber, #17840) [Link] (1 responses)

Its cast must be preserved in transit but terminal MTAs are free to treat them as case insensitive. Most do, these days.

Filesystems and case-insensitivity

Posted Dec 8, 2018 23:37 UTC (Sat) by zlynx (guest, #2285) [Link]

Fedora's default Exim configuration did not make it insensitive. I had to add that myself because my server was rejecting a lot of mail, mostly from Gmail users randomly capitalizing things.

> lowercase_local:
> driver = redirect
> data = ${lc:${local_part}}

Filesystems^H directories and case-insensitivity

Posted Dec 2, 2018 8:00 UTC (Sun) by marcH (subscriber, #57642) [Link]

If you thought case-insensitivity was ill-conceived...

As a coincidence I recently got hurt by this NTFS "feature": *per-directory* case-sensitivity
https://github.com/vector-of-bool/vscode-cmake-tools/issu...

In other words: filenames are case sensitive in some directories but not in other directories next or below them on the very same filesystem.

Filesystems and case-insensitivity

Posted Dec 10, 2018 15:04 UTC (Mon) by davez (guest, #104707) [Link] (1 responses)

Anybody wanting to add case-insensitivity to a filesystem should read this blog post:

http://drewthaler.blogspot.com/2007/12/case-against-insen...

Also, consider the fact that Apple's iOS filesystem is case sensitive, *not* case insensitive.

Filesystems and case-insensitivity

Posted Dec 10, 2018 16:07 UTC (Mon) by corbet (editor, #1) [Link]

For the curious, Linus has also sounded off on the idea; he is not impressed either.