> I think that many people would appreciate having at least a hint as to the character encoding in use.
My guess is that unix systems just take the easiest approach here - treat the filename as a binary blob, and let userspace do the rest :) I have got to admit that in practice there's less hassle with FSes which are Unicode-aware (think Microsoft), unless you actually start trying to figure just what is that you are allowed to use there for filenames. Then you'd basically just stick to base64 or percent-encoding, which would be the right thing to do in any case.
Posted Nov 24, 2010 19:15 UTC (Wed) by michaeljt (subscriber, #39183)
[Link]
> I have got to admit that in practice there's less hassle with FSes which are Unicode-aware (think Microsoft), unless you actually start trying to figure just what is that you are allowed to use there for filenames.
There have been a number of complaints on this thread about filesystems that are encoding-aware and the problems that causes. But actually the filesystem could carry encoding hints without being encoding-aware itself. For example, it could tell user space that a file name is Utf-8 but still just treat the name as a binary blob. The hint would just tell applications how best to display the name.
Control characters in file names
Posted Nov 27, 2010 8:11 UTC (Sat) by cmccabe (guest, #60281)
[Link]
> There have been a number of complaints on this thread about filesystems
> that are encoding-aware and the problems that causes. But actually the
> filesystem could carry encoding hints without being encoding-aware itself.
> For example, it could tell user space that a file name is Utf-8 but still
> just treat the name as a binary blob. The hint would just tell
> applications how best to display the name
Well, you could use an extended attribute to represent the encoding of the filename. However, it would be a huge amount of work to change all the applications to check this attribute and act appropriately.
I'm pretty far from being an expert in internationalization, but my understanding is that non-unicode character encodings are considered deprecated. Based on comments made elsewhere in this thread, MacOS and Windows have already decreed that all filenames should be unicode. So is it really worth rewriting all software that dislays filenames in order to better support this legacy stuff? Especially when no other platforms support it at all? As Linus constantly points out, Linux-specific filesystem interfaces don't get used that much, even when they offer great benefits.
I think I agree with Spudd86's solution: there should be some kind of mount option that puts a ruleset in place for filenames. Probably nearly every Linux distribution would disallow filenames that were not UTF-8. A few people running special-purpose systems might mount their rootfs with more restrictive rulesets. Most system administrators already have an unwritten policy about filenames-- they don't create filenames with embedded control characters, crazy stuff like leading dashes, or embedded newlines. Letting system administrators turn their implicit policy into an explicit one would close a lot of security holes.
I wonder if it would be feasible to use the "escaping" option talked about on Wheeler's page. Basically, under this option, the kernel continues to treat filenames as binary blobs on the disk. But when presenting them to userspace, it escapes certain characters in a predictable way. I'm not sure whether this is really feasible, but it seems like the best choice if it is.
Control characters in file names
Posted Nov 30, 2010 1:39 UTC (Tue) by jamesh (guest, #1159)
[Link]
As well as being a lot of work, using extended attributes introduces ambiguity. Some extra problems with that suggestion are:
You could have two files in a directory with the same sequence of unicode code points but different byte representations due to be encoding differently.
Applications might encounter paths like /latin1-part/utf8-part/sjis-part and need to check the encoding of each path component in order to display it to the user. Perhaps more difficult would be resolving a unicode path to something like this.
Extended attributes are associated with the file rather than the file name. What do you do if a file has two hard links with differently encoded file names?
Picking one encoding/normalisation is the only sane option, and it would be nice if the kernel would help enforce such a choice.
Control characters in file names
Posted Dec 2, 2010 18:22 UTC (Thu) by Wol (guest, #4433)
[Link]
One problem with that ... (administrators enforcing policy, that is)
I've worked on a system where a file was composed of sub-files (Pr1mos). This was emulated on nix by using a directory with "special" names inside, namely all the subfiles were "<space><backspace><number>". Because nobody is supposed to touch these subfiles directly.
So if you enforce a policy like that, you could bust a bunch of apps ...