User: Password:
Subscribe / Log in / New account

Method for (mostly) kernel-independant Unicode filenames?

Method for (mostly) kernel-independant Unicode filenames?

Posted Feb 19, 2004 16:28 UTC (Thu) by Max.Hyre (guest, #1054)
In reply to: The kernel and character set encodings by danscox
Parent article: The kernel and character set encodings

[Strawman proposal---please point me toward discussions where it's all been hashed out, shot down, &c. Or, just flame direct.]

How about changing filename semantics (and, of course, every filesystem known to Linux): make filenames a three-element struct: a fixed-length specification of the name's character-set encoding, a fixed-length count of the bytes in the name, and a variable-length string holding said name:

    struct filename {
        enum encoding enc;
        int cb;
        byte *rgb;

Now, the kernel doesn't give a fig what the encoding is, or what it might mean---it's all bytes, with no chance (hah!) of filename buffer overflows and their attendant dangers to root. The libraries just use the struct for calls to fopen(), remove(), rename(), & friends, with the caller allowed to specify that

  • an exact match (on all elements of the struct) is needed for equality comparisons,
  • a bytewise match on the byte *s, regardless of the encoding, is sufficient, or
  • its own comparator function (supplied) be run on pairs of the structs.

The kernel code is encoding-agnostic, and the rest of the work (emphatically including sorting) is in userland.

(Log in to post comments)

A few problems

Posted Feb 20, 2004 17:35 UTC (Fri) by Ross (guest, #4065) [Link]

1) What filesystems support per-file character set selection? Which ones
can handle embedded NUL characters? What about maximum filename length
considerations -- you are no longer measuring in characters because the
number of bytes they use depends on the encoding.

2) There are a whole lot of system calls receiving or returning filenames
(the libc routines linke fopen() are a different layer): open(),
getdirentries(), readlink(), stat(), lstat(), rename(), unlink(), link(),
mknod(), chown(), chmod(), utime(), mount() etc. (not to mention Unix
domain sockets). These would all have to change. But POSIX defines them
as taking certain parameters and having certain return types. So you
either have to drop Unix compatability or you have to add duplicate
versions of each one much like Microsoft did when converting to UCS2.

3) What about Unix applications and old Linux apps? They won't even
compile if you change the prototypes. If you don't and make the old
system calls default to UTF8 or something you still have to make them work
with filenames in other encodings.

4) It won't fix the policy problem without involving the kernel anyway.
What about case insensitivity, canonicalizing characters, path delimeters
etc.? You removed the need for the terminating NUL but what about the "/"
character? What about character sets with no slash, or with multiple
slashes? The kernel will need to know what these are and that will depend
on the character set.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds