Not logged in
Log in now
Create an account
Subscribe to LWN
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 21:29 UTC (Wed) by foom (subscriber, #14868)
Posted Mar 25, 2009 22:36 UTC (Wed) by nix (subscriber, #2304)
Posted Mar 26, 2009 2:25 UTC (Thu) by njs (guest, #40338)
I'd be happy if we could just make a rule that filenames are valid UTF-8. Unicode normalization (composing characters and all that) is probably a good idea, but reasonable people could disagree. I'm just as happy without case normalization (though the arguments for it aren't entirely without merit, even if it can't be done perfectly). And *any* of these would be better than what we have now...
(The "so what did you call this file?" API is also useful if your system ever deals with case-insensitive or unicode-normalizing filesystems. Which Linux does, whether it becomes common for the root filesystem or not.)
Posted Mar 26, 2009 13:42 UTC (Thu) by clugstj (subscriber, #4020)
Why? How is the current condition so bad that we should run headlong into any of these "solutions" without knowing what the eventual outcome will be?
Posted Mar 26, 2009 17:46 UTC (Thu) by quotemstr (subscriber, #45331)
Posted Mar 26, 2009 18:59 UTC (Thu) by njs (guest, #40338)
Posted Mar 29, 2009 21:27 UTC (Sun) by clugstj (subscriber, #4020)
Posted Mar 30, 2009 0:07 UTC (Mon) by njs (guest, #40338)
Posted Mar 29, 2009 21:30 UTC (Sun) by clugstj (subscriber, #4020)
Posted Mar 29, 2009 22:07 UTC (Sun) by foom (subscriber, #14868)
Posted Mar 26, 2009 14:23 UTC (Thu) by michaeljt (subscriber, #39183)
Posted Mar 28, 2009 1:00 UTC (Sat) by nix (subscriber, #2304)
Posted Apr 2, 2009 15:54 UTC (Thu) by forthy (guest, #1525)
It is actually not that bad. As collating sequence, ß=ss (i.e. Mass
and Maß sort to the same bin). Except for Austrian telephone books, where
ß follows ss, but comes before st (though St. follows Sankt ;-).
However, there's a huge mess in the CJK part of UCS: short and long
forms of the same character (sometimes even a special variant for the
Japanese character). This should never have happend, the different forms
of the same character should be encoded in fonts, not in UCS. So far, not
even Mac OS X normalizes these characters, but it is obvious that a
mainland China file called "中国" and a Taiwan file called "中國" not only
mean the same, but they also refer to the same word, and can be
interchanged at will (see for example the Chinese wikipedia entry: the
lemma is the short form, the headline is the long form). And it is not
easy to access long and short forms with usual input methods (mainland
China: Pinyin, Canton: Cantonese Pinyin (gives traditional characters,
bug you need to know Cantonese), etc.).
Posted Mar 26, 2009 13:40 UTC (Thu) by clugstj (subscriber, #4020)
Posted Mar 26, 2009 19:52 UTC (Thu) by leoc (subscriber, #39773)
Posted Mar 25, 2009 21:39 UTC (Wed) by njs (guest, #40338)
I don't *like* either alternative much, but I doubt you're going to get everyone to switch back to ASCII, either. The problem isn't going away.
So... we can whine about how unfair it is that character systems are complicated and ignore the problem, or we can hold our noses and pick a least-bad option. The latter is probably more productive (though inertia suggests the former is most likely).
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds