|
|
Log in / Subscribe / Register

Wheeler: Fixing Unix/Linux/POSIX Filenames

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 22:36 UTC (Wed) by nix (subscriber, #2304)
In reply to: Wheeler: Fixing Unix/Linux/POSIX Filenames by foom
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames

You do know that case conversion rules are necessarily locale-dependent,
right? Case-insensitivity in filesystems is thus an astoundingly awful
idea.


to post comments

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 2:25 UTC (Thu) by njs (subscriber, #40338) [Link] (7 responses)

Too true. OS X solves this by defining a bespoke case normalization rule, the "HFS-Plus case-insensitive string comparison algorithm". They have their own big table and everything, you can download it at the first URL I gave. Awesome, huh?

I'd be happy if we could just make a rule that filenames are valid UTF-8. Unicode normalization (composing characters and all that) is probably a good idea, but reasonable people could disagree. I'm just as happy without case normalization (though the arguments for it aren't entirely without merit, even if it can't be done perfectly). And *any* of these would be better than what we have now...

(The "so what did you call this file?" API is also useful if your system ever deals with case-insensitive or unicode-normalizing filesystems. Which Linux does, whether it becomes common for the root filesystem or not.)

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 13:42 UTC (Thu) by clugstj (subscriber, #4020) [Link] (6 responses)

"And *any* of these would be better than what we have now"

Why? How is the current condition so bad that we should run headlong into any of these "solutions" without knowing what the eventual outcome will be?

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 17:46 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

We're talking about the possible outcomes. You're telling us we shouldn't discuss the possible problems and solutions because we don't know the problems yet? That's bunk.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 18:59 UTC (Thu) by njs (subscriber, #40338) [Link] (4 responses)

...I started a thread by pointing to technical documentation on how this has worked for the last 8 years in the world's most widely-deployed Unix, and your response is that this is crazy to even consider because we have no way to know what will happen? C'mon... engage the actual arguments. I don't think it's obvious what the technically best solution is, but just because you haven't thought about the relevant details doesn't mean they aren't knowable.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 21:27 UTC (Sun) by clugstj (subscriber, #4020) [Link] (1 responses)

I'm sorry, but when you said, that any of these propositions is better than the current situation, I HAD to disagree. In what way is the current situation so bad that any proposal is better that the current situation?

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 30, 2009 0:07 UTC (Mon) by njs (subscriber, #40338) [Link]

You cannot, in general, convert a filename to text. That's the fundamental problem that any of the proposals would solve.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 21:30 UTC (Sun) by clugstj (subscriber, #4020) [Link] (1 responses)

OS X is trivial to handle. It only has to continue to work in a compatible way with the previous Mac OS - which wasn't UNIX. So using it as an example of how to "fix" these problems is not a good idea if you care about supporting 40+ years of UNIX programs - which is why this is difficult to change.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 22:07 UTC (Sun) by foom (subscriber, #14868) [Link]

Eh...but OSX *does* run 40+ years of UNIX programs. It's pretty clear that the change to require
UTF-8 (and even the change to be case insensitive!) didn't bother most programs.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 14:23 UTC (Thu) by mjthayer (guest, #39183) [Link] (2 responses)

The case I always hear made for the difficulty of case-insensitivity is French lower-case accents and the Turkish i. Are these really such an issue? If we are already treating a as being identical to A, can't we just treat É as being identical to E, and i as being identical to ı? In French it would not really cause problems (although the pedants would complain :) ), though I don't know how many words Turkish has that only differ by a dot over the i...

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 1:00 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

German ß is problematic too. Whether 'SS' turns into ß or not on
downcasing is *context-dependent* and to a certain extent a matter of
controversy and thus taste (this wasn't always true, but successive waves
of largely-failed spelling reforms have introduced a nice steaming heap of
uncertainty into this part of the written language).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Apr 2, 2009 15:54 UTC (Thu) by forthy (guest, #1525) [Link]

It is actually not that bad. As collating sequence, ß=ss (i.e. Mass and Maß sort to the same bin). Except for Austrian telephone books, where ß follows ss, but comes before st (though St. follows Sankt ;-).

However, there's a huge mess in the CJK part of UCS: short and long forms of the same character (sometimes even a special variant for the Japanese character). This should never have happend, the different forms of the same character should be encoded in fonts, not in UCS. So far, not even Mac OS X normalizes these characters, but it is obvious that a mainland China file called "中国" and a Taiwan file called "中國" not only mean the same, but they also refer to the same word, and can be interchanged at will (see for example the Chinese wikipedia entry: the lemma is the short form, the headline is the long form). And it is not easy to access long and short forms with usual input methods (mainland China: Pinyin, Canton: Cantonese Pinyin (gives traditional characters, bug you need to know Cantonese), etc.).


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds