Filesystems and case-insensitivity

Posted Nov 28, 2018 14:59 UTC (Wed) by gioele (subscriber, #61675)
Parent article: Filesystems and case-insensitivity

> Beyond that, "case" is really only defined in terms of an encoding

> Supporting case-insensitive file names requires the encoding-awareness changes in order to define what case folding means for a given character.

"Case" is properly defined only in terms of locale, not of encoding. Knowing the encoding (say, UTF-8+NFD vs UTF-16+NFKC) is necessary, but not sufficient. The user locale is needed as well.

In English "istanBUL" matches case-insensitively "Istanbul", in Turkish it does not. (In Turkish the uppercase version of "i" is "İ".)

What the developers could do is a kind of case-insensitive look-up that also clusters together "similar" letters. Defining which characters are similar opens, however, another can of worms (see `confusables.txt` from Unicode or all the discussions around IDNA and its Nameprep algorithm).

Maybe we should come up with another technical name for these locale-independent imprecise implementations of case insensitiveness?

Filesystems and case-insensitivity

Posted Nov 28, 2018 16:03 UTC (Wed) by anselm (subscriber, #2796) [Link]

Here's an interesting post by James Bennett (Django core developer) on the topic of “case”: Truths programmers should know about case

Filesystems and case-insensitivity

Posted Dec 2, 2018 16:42 UTC (Sun) by epa (subscriber, #39769) [Link] (1 responses)

Is there any reason not to treat i, İ, I, and ı the same for case-folding purposes on the file system?

I am not asking whether they are the same in all uses. I know that in Turkish i and ı are different letters. What I'm suggesting is that for making a case-insensitive filesystem lookup -- where you have already waved goodbye to a strict 1-1 mapping between byte sequences and directory entries -- it surely doesn't matter that much to gloss over the distinction and treat all these four characters the same. Similarly I would consider it a feature, not a bug, if accented characters could be preserved in filenames, but ignored when matching. There are pairs of words in German that differ only in accent, but it's very unlikely an accent would be the only difference between two human-written document names.

Now, you may with some justice argue that loose matching like this belongs in user space, not the kernel. But in the end it's not my preferences or anyone else's that matter. What matters is to efficiently implement the existing (de facto or de jure) standards. What behaviour is Samba required to support with the Turkish uppercase and lowercase letters? The kernel should provide the semantics that Samba needs so it doesn't have to laboriously scan the whole directory to match a filename.

Filesystems and case-insensitivity

Posted Dec 2, 2018 17:19 UTC (Sun) by gioele (subscriber, #61675) [Link]

> Is there any reason not to treat i, İ, I, and ı the same for case-folding purposes on the file system?

Sure they could. But doing it is hard (and computationally expensive).

This is what I meant with

> What the developers could do is a kind of case-insensitive look-up that also clusters together "similar" letters. Defining which characters are similar opens, however, another can of worms (see `confusables.txt` from Unicode or all the discussions around IDNA and its Nameprep algorithm).