> There have been a number of complaints on this thread about filesystems
> that are encoding-aware and the problems that causes. But actually the
> filesystem could carry encoding hints without being encoding-aware itself.
> For example, it could tell user space that a file name is Utf-8 but still
> just treat the name as a binary blob. The hint would just tell
> applications how best to display the name
Well, you could use an extended attribute to represent the encoding of the filename. However, it would be a huge amount of work to change all the applications to check this attribute and act appropriately.
I'm pretty far from being an expert in internationalization, but my understanding is that non-unicode character encodings are considered deprecated. Based on comments made elsewhere in this thread, MacOS and Windows have already decreed that all filenames should be unicode. So is it really worth rewriting all software that dislays filenames in order to better support this legacy stuff? Especially when no other platforms support it at all? As Linus constantly points out, Linux-specific filesystem interfaces don't get used that much, even when they offer great benefits.
I think I agree with Spudd86's solution: there should be some kind of mount option that puts a ruleset in place for filenames. Probably nearly every Linux distribution would disallow filenames that were not UTF-8. A few people running special-purpose systems might mount their rootfs with more restrictive rulesets. Most system administrators already have an unwritten policy about filenames-- they don't create filenames with embedded control characters, crazy stuff like leading dashes, or embedded newlines. Letting system administrators turn their implicit policy into an explicit one would close a lot of security holes.
I wonder if it would be feasible to use the "escaping" option talked about on Wheeler's page. Basically, under this option, the kernel continues to treat filenames as binary blobs on the disk. But when presenting them to userspace, it escapes certain characters in a predictable way. I'm not sure whether this is really feasible, but it seems like the best choice if it is.