|
|
Log in / Subscribe / Register

Case-insensitive filesystem lookups

Case-insensitive filesystem lookups

Posted May 23, 2018 18:09 UTC (Wed) by saffroy (guest, #43999)
In reply to: Case-insensitive filesystem lookups by EdwardConnolly
Parent article: Case-insensitive filesystem lookups

Different environments have different expectations regarding case sensitivity for names of files and directories on a filesystem. I wasn't aware of the situation with Android mentioned in the article, but I do know about the issue with Linux-based SMB servers.

The most commonly used network filesystem in the Windows and Mac worlds is now SMB (Apple deprecated its own AFP file service a few years ago), and these environments (applications, tools) expect file names to be case insensitive, that is: in any given directory, "foo" and "FOO" and "Foo" must resolve to the same file (if they resolve to any file at all). Another expectation is that the filesystem be case preserving: if the application creates file "Foo", it is expected to show up as "Foo" (and not "foo" or "FOO") in a directory listing.

But of course, in Unix tradition, a Linux filesystem is case sensitive.

To meet these expectations, a Linux-based SMB server (Samba) running on a case-sensitive Linux filesystem needs to do extra work which can badly impact performance. For example, when creating "Foo", Samba needs to check if *any* variant of "[fF][oO][oO]" already exists; typically, this is done by scanning the entire directory, which becomes horribly expensive when that directory is large (think thousands of entries). This is pretty bad. And some SMB clients (such as the MacOS Finder) make things worse by their specific quirks (Appledouble files come to mind).

If the underlying filesystem were case insensitive (and case preserving), Samba would have much less work to do in many cases. And actually, some filesystem vendors (eg. IBM for GPFS) provide Samba modules that reduce the impact when possible: the filesystem can be made case-sensitive (entirely, or per directory), and/or its Samba module can provide an optimized case-insensitive lookup method that bypasses the VFS.


to post comments

Case-insensitive filesystem lookups

Posted May 23, 2018 21:15 UTC (Wed) by Sesse (subscriber, #53779) [Link] (11 responses)

Even more difficult, the notion of case depends on the locale.

E.g.: In English, i and I are the same letter with different case. In Turkish, they are entirely different letters, and thus should not compare equal.

Case-insensitive filesystem lookups

Posted May 23, 2018 21:27 UTC (Wed) by saffroy (guest, #43999) [Link] (1 responses)

Indeed, locale matters, and this means the definition of case also depends on the character encoding.

Unicode case folding does simplify things, but AFAIK the Turkish "i" still requires some special care (see this page for a detailed explanation).

Case-insensitive filesystem lookups

Posted May 25, 2018 18:58 UTC (Fri) by wahern (subscriber, #37304) [Link]

There's really only a single choice for doing case-insensitive names: rely on Unicode case-folding rules. It's the only choice because it's the only one, AFAIK, with an explicit guarantee regarding forward stability:

For each string S containing characters only from a given Unicode version, toCasefold(toNFKC(S)) under that version is identical to toCasefold(toNFKC(S)) under any later version of Unicode.

Backward compatibility--mounting a volume created on a system supporting Unicode X-1 that was created on a system supporting Unicode X-- is problematic but I don't think that's entirely avoidable.

Case-insensitive filesystem lookups

Posted May 24, 2018 4:32 UTC (Thu) by eru (subscriber, #2753) [Link]

This natural language issue is a very good argument for case-sensitive file systems, or rather, file systems that simply accept (almost) arbitrary strings as names, and do not try to find equivalences between them other than comparing the bytes. I hope adding case-insensitivity support to Linux does not result in people actually starting to use it for anything else than handling legacy compatibility.

Case-insensitive filesystem lookups

Posted May 24, 2018 6:36 UTC (Thu) by epa (subscriber, #39769) [Link] (7 responses)

This is true, but for the purposes of filenames, does it really matter? Technically SS and “beta S” are different in German but they can also be interchanged some of the time. Surely a filesystem that treated I-with-dot and I-without-dot the same (while preserving what was input) would be good enough in practice. A spell checker has to be stricter.

Case-insensitive filesystem lookups

Posted May 24, 2018 7:23 UTC (Thu) by Sesse (subscriber, #53779) [Link] (6 responses)

It's good enough in practice for you, but it would not necessarily for someone else.

Would you be content with a filesystem that said v and w is the same (but preserved with one was used), if someone came to you and proposed that? I mean, in practice it might be good enough…

Case-insensitive filesystem lookups

Posted May 24, 2018 9:32 UTC (Thu) by dgm (subscriber, #49227) [Link]

We can also make "s" and "z" the same for the bennefit of all our britton friends out there...

Case-insensitive filesystem lookups

Posted May 24, 2018 10:34 UTC (Thu) by epa (subscriber, #39769) [Link] (4 responses)

What I mean is, the more you get into these distinctions, the further away you move from what makes case sensitivity useful to start with. I appreciate the convenience of having a file called Sandia.txt on disk and being able to load it by the name sandia.txt, so I can save the effort of pressing the shift key or remembering exactly what the capitalization was. I would appreciate less getting a file not found error because it was called Sandía.txt and I forgot to include the accent on the letter i. But then, in Turkish the distinction between i and ı is probably a lot stronger than the difference of an accent in Spanish.

All in all it's a knottier problem than it appears (https://bugzilla.mozilla.org/show_bug.cgi?id=202251 has been going on for 15 years) and I sympathize with the view that these things should be handled in the user interface, not the filesystem. If you have to put locale code in the filesystem itself you've surely taken a wrong turning.

Case-insensitive filesystem lookups

Posted May 24, 2018 17:45 UTC (Thu) by excors (subscriber, #95769) [Link] (1 responses)

> I sympathize with the view that these things should be handled in the user interface, not the filesystem. If you have to put locale code in the filesystem itself you've surely taken a wrong turning.

In many cases, I think filenames really are a UI concept that is being used directly as a core part of the filesystem (the disk format plus the associated APIs and protocols like SMB), which feels like a serious layering violation. When a user saves a document, they give it a human-readable name so they can find it later in a list of all their saved documents. They don't care if it's stored with that name as its filename, or if it's stored as "cff5f247-64bd-4066-ab2f-66ff8aed2322.doc" and the name is in some metadata, or if it's stored in a special database and not as a separate file at all - the UI could be the same for all of those. But since we choose to implement it with human-readable filenames, the UI is complicated by filesystem restrictions (why can't the user put "/" in a document name?), and the filesystems(/APIs/protocols/etc) are complicated by UI issues (Unicode, case sensitivity, locale dependence, etc). It seems particularly bad given that Unicode changes over time, and locales differ between users, while filesystems are persistent and shared - there's a fundamental mismatch there.

Surely there must be a better way to design the system, if legacy compatibility didn't matter, where the implementation details of storing and referencing files are more cleanly separated from the UI concept of naming files? (Though of course legacy compatibility does matter more than almost anything else, so this is hypothetical and probably pointless.)

(There are other cases where filenames aren't UI, they're well-known identifiers like "/etc/passwd" or "c:\autoexec.bat" - the name is needed as a portable way for programs to refer to a particular file. But they have very different requirements to user-chosen document names, e.g. ASCII is probably fine, and it's not obvious that the same solution should be used for them.)

Case-insensitive filesystem lookups

Posted May 25, 2018 19:03 UTC (Fri) by drag (guest, #31333) [Link]

> Surely there must be a better way to design the system, if legacy compatibility didn't matter, where the implementation details of storing and referencing files are more cleanly separated from the UI concept of naming files?

Since Unix files can be arbitrary strings then just use a hash of the file to store it in the file system. Then you manage names on the application layer by providing a handy dandy API for everybody to use.

Because just imagine that instead of one locale you have to make insensitivity work for ALL locales. A lot of Linux file systems house data that is globally sourced using languages and names from dozens, if not hundreds, of different languages.

Good luck making that work on a file-system layer.

I mean: what are you going to do?

To have any remote chance of making it work in a case sensitive manner is by having the locale of each file embedded right there in the file system's metadata so it can be correctly managed in the way it was intended. And then what are you going to do when you have a English user from North America edit a file somebody made from Greece? Change the locale? Make the insensitivity work differently or now force the English user to understand the character set used by the other person from Greece? How are you going to deal with file names that don't conflict in the original locale, but do after somebody edits it?

So the choice is really:

1. Have a case sensitive file system that always works under all circumstances that is simple, robust, and fast.

2. Have a case insensitive system with massive amounts of extra code and logic that will never actually have a chance of working.

YES; having a sensitive file system is a bad UI. But it's impossible to make it actually work otherwise.

Therefore: If you are looking for a very good user interface exposing a Unix-style file system to the user is not a good solution. You have to do something else.

Case-insensitive filesystem lookups

Posted May 26, 2018 15:48 UTC (Sat) by eru (subscriber, #2753) [Link] (1 responses)

But then, in Turkish the distinction between i and ı is probably a lot stronger than the difference of an accent in Spanish.

I don't know how it is in Turkish, but in my native Finnish you cannot be careless with dieresis on top of "a" or "o". Dropping it can change a word into a different word. For example, "sää" and "saa" are both valid Finnish words with entirely different meanings. Of course, humans usually can figure out words with omitted dots from context, the same way one can mentally correct other kinds of mis-spellings.

By the way, someone jokingly mentioned making "v and "w" equivalent. Actually the normal alphabetical ordering rules of Finnish specify precisely that. However, Linux "ls", "bash" and so on under the Finnish locale does not obey this particular rule. Probably a good thing, I won't be filing a bug report...

Case-insensitive filesystem lookups

Posted May 27, 2018 9:39 UTC (Sun) by epa (subscriber, #39769) [Link]

Even in English the difference of upper and lower case can change one word to another, from polish to Polish.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds