LWN: Comments on "The kernel and character set encodings"

Re-mount Through Caseless VFS?

massimiliano — Mon, 23 Feb 2004 07:40:14 +0000

I am definitely not a kernel developer, but this sounds like
the perfect solution: very general, and perfectly decoupled
from the code of existing filesystems... and moreover, you pay
the performance penalty only if you use the feature.

As an added benefit, it could be implemented entirely in user
space using FUSE, and only if/when it works very well (and the
added performance is needed) as a kernel module.

With such a layer, it would also be possible to handle all those
nasty Unicode normalizations...

Just my two cents, anyway.

Re-mount Through Caseless VFS?

miallen — Mon, 23 Feb 2004 00:37:01 +0000

Why not just create a "casefs" VFS that just uses the existing ops for the target mounted fs but overloads lookup() to do the caseless pathwalk (and maybe save the last N paths with hashes in a separate cache)? Now you would just (re)mount an existing fs through this casefs VFS. It wouldn't be optimal but it would still be a lot faster for Samba, WINE, or whoever and it wouldn't barf all over any other kernel code. It's probably not a lot of code either.

Mike

The kernel and character set encodings

fiberbit — Sat, 21 Feb 2004 21:14:37 +0000

The problem lies in the checking whether or not a file with a given name (case insensitive) exists. Say you do an 'fp = fopen("filename", "a"), and "filename" doesn't exist yet, then in the case-insensitive case, you have to check whether "Filename" or "fIlename" or any other variant *does* exist.
You'd either have to try all possible combinations, or (in practice) scan the whole directory to see if any name matches (and use the first). This not only is very time consuming, but also racy in a multi process environment.
It could be solved by using case-insensitive hash functions in the dentry cache, but that would negatively impact normal filesystems, and is unacceptable to most, including the top penguin.

The kernel and character set encodings

Cato — Sat, 21 Feb 2004 07:49:26 +0000

This problem needs to be addressed somewhere, though not necessarily in the kernel (perhaps in glibc or the GUI layer): two users create identical looking filenames using Vietnamese accented characters (letter + 2 accents in different order, 3 Unicode characters altogher). Then, there are two identical-looking filenames and you don't know how to type the 'right' one. Even if there is only one file involved, without Unicode normalisation you wouldn't be able to use bash filename completion, since you might type the accents in a different order to that used in the filename, though there would be no visual clue as to your mistake.

Given these issues, which affect command line tools as much as GUIs, it may be sensible to put NFC normalisation in glibc or the kernel, despite the complexity. Files created from another system on a Linux NFS filesystem would of course bypass glibc, so the alternatives are batch renormalisation (always an option, convmv may do this) or putting NFC in the kernel.

It's not good enough to say 'case-insensitivity should not be in the kernel' - you need to address these use cases and say how and where you would solve them.

The kernel and character set encodings

spitzak — Fri, 20 Feb 2004 22:37:34 +0000

Could somebody explain why the case-insensitivity is so important, even
for Samba? It seems to me there cannot be too many Windows programs that
take a filename provided to it by the system and change the case before
using it. My tests show that when you double-click files in Explorer or
from the file chooser or any other way I found to select the files,
Windows gave my program the filename with the exact same case as it was
reported in the file listing.

Yes users can type in the wrong case into a shell, but aren't
command-line interfaces supposed to be "unfriendly"? Why does anybody
care if user-unfriendly interfaces work for stupid users or not?

The kernel and character set encodings

spitzak — Fri, 20 Feb 2004 22:32:13 +0000

I agree that a length is needed, not just for encoding NUL, but to allow
a slash-seperated name to be quickly converted to this form, without a
need to malloc and copy a block of memory for each piece.

One possibly less-ugly scheme is to use Plan9's "walk" style. You have
"file descriptors" that represent a filename, unopened as yet. These are
created by copying an
existing one (a small set, such as one for "/", are provided when the
program starts up, like stdin/out). There is then a call something like
walk(fd,char* name,int length) which moves fd to the subdirectory in
name[0..length-1]. When you finally are at the desired file you call
open(fd,mode). Existing open() calls would be turned into a bunch of walk
calls followed by a new open.

With this, no arrays are passed to the kernel, and it does not have to
store these arrays.

Unicode bugs

spitzak — Fri, 20 Feb 2004 22:23:06 +0000

Avoiding those bugs is one of the primary reasons why UTF-8 is a good
idea.

"../" in a UTF-8 filename means the *BYTES* for '.', '.', and '/' appear
next to each other. It is entirely irrelevant if the UTF-8 string is
legal or if it contains a byte sequence that some broken software by
Microsoft will turn into a slash.

I don't know how many times this has to be stated. But if your program is
looking at a UTF-8 string and is doing anything other than drawing the
characters on the screen, YOU DO NOT NEED TO DECODE IT! Just look at the
bytes!

The kernel and character set encodings

spitzak — Fri, 20 Feb 2004 22:19:48 +0000

There is no problem with UTF-8 filenames. The bytes should be stored
unchanged, and unchanged bytes should be used to look up the file. It
does not matter if those bytes are a legal UTF-8 string or not, to say
nothing of what normalization form they are.

Unfortunately there are hordes of people out there who think dumb ideas
like case-insensitivity should be applied at low levels to stuff that
really is binary data. This kind of thinking is what causes complexity,
and complexity causes bugs and security holes.

Any program that takes a string it thinks is UTF-8 and does
<i>ANYTHING</i> other than pass the exact bytes unchanged to another
interface that wants UTF-8 is by definition broken. This simple rule will
completely eliminate all ambiguity about UTF-8.

The kernel and character set encodings

Ross — Fri, 20 Feb 2004 17:37:36 +0000

But you are using C strings to denote the elements which means they are
still NUL terminated. To fix it you need a second array for the path
component lengths. I think you are unlikely to convince any of the kernel
guys this isn't too ugly to live.

A few problems

Ross — Fri, 20 Feb 2004 17:35:21 +0000

1) What filesystems support per-file character set selection? Which ones
can handle embedded NUL characters? What about maximum filename length
considerations -- you are no longer measuring in characters because the
number of bytes they use depends on the encoding.

2) There are a whole lot of system calls receiving or returning filenames
(the libc routines linke fopen() are a different layer): open(),
getdirentries(), readlink(), stat(), lstat(), rename(), unlink(), link(),
mknod(), chown(), chmod(), utime(), mount() etc. (not to mention Unix
domain sockets). These would all have to change. But POSIX defines them
as taking certain parameters and having certain return types. So you
either have to drop Unix compatability or you have to add duplicate
versions of each one much like Microsoft did when converting to UCS2.

3) What about Unix applications and old Linux apps? They won't even
compile if you change the prototypes. If you don't and make the old
system calls default to UTF8 or something you still have to make them work
with filenames in other encodings.

4) It won't fix the policy problem without involving the kernel anyway.
What about case insensitivity, canonicalizing characters, path delimeters
etc.? You removed the need for the terminating NUL but what about the "/"
character? What about character sets with no slash, or with multiple
slashes? The kernel will need to know what these are and that will depend
on the character set.

The kernel and character set encodings

flewellyn — Fri, 20 Feb 2004 02:24:17 +0000

Why does the kernel even use "/" and NUL? Seriously, pathnames should be internally coded as structures, not strings. The only parsing of pathname strings should occur in the C library, including syscall wrappers. The kernel should not have any notion, internally, of pathname separators. It's just silly.

Instead, I propose something like this: stick each element of the pathname into an array element, innermost first (that is, the "root" directory would be LAST element of the array), and use a special token to indicate ROOT. You could have the array live in a struct, with the other struct element being the length of the array, if you like. Something like this:

struct pathname {
int length;
char* elements[];
}

This way you could get at the file's name with a simple elements[0], and walk the directory tree from root to the file like this:

for (i = length; i >=0; i--) {blah blah blah whatever};

No worrying about parsing out the "/" separators.

How would a case-insensitive magic_open() call work?

chad.netzer — Thu, 19 Feb 2004 19:42:02 +0000

You get an arbitrary file. Tridge suggests that (so far) he hasn't gotten complaints about this kind of behavior (which already exists in Samba), and there are few good alternatives. One possible alternative, to try to keep track of which files are created by Posix systems, and which are created by Windows systems, and preferentially decide between the two, seems like too much work if no one really cares.

The whole case insensitivity issue of Windows is (apparently) a mess, and there appears to be no perfect policy about what to do when interoperating, other than try to do the thing which makes most practical sense.

Method for (mostly) kernel-independant Unicode filenames?

Max.Hyre — Thu, 19 Feb 2004 16:28:00 +0000

[Strawman proposal---please point me toward discussions where it's all been hashed out, shot down, &c. Or, just flame direct.]

How about changing filename semantics (and, of course, every filesystem known to Linux): make filenames a three-element struct: a fixed-length specification of the name's character-set encoding, a fixed-length count of the bytes in the name, and a variable-length string holding said name:

    struct filename {
        enum encoding enc;
        int cb;
        byte *rgb;
    };

Now, the kernel doesn't give a fig what the encoding is, or what it might mean---it's all bytes, with no chance (hah!) of filename buffer overflows and their attendant dangers to root. The libraries just use the struct for calls to fopen(), remove(), rename(), & friends, with the caller allowed to specify that

an exact match (on all elements of the struct) is needed for equality comparisons,
a bytewise match on the byte *s, regardless of the encoding, is sufficient, or
its own comparator function (supplied) be run on pairs of the structs.

The kernel code is encoding-agnostic, and the rest of the work (emphatically including sorting) is in userland.

Unicode cannot be secure---B. Schneier

Max.Hyre — Thu, 19 Feb 2004 15:28:42 +0000

Well, that should get their attention. :-) The exact wording was ``Unicode is just too complex to ever be secure.''

In his July 2000 Crypto-gram article on Unicode, Schneier points up the failures we've had dealing with ASCII control characters, escape sequences, different semantics at different levels of the application (think writing a bash command to grep for a particular grep regular expression), and concludes that with Unicode it's not merely hard, it's effectively impossible.

I don't know enough about Unicode to argue the details, but it certainly made me sit up and take notice.

Unicode bugs

Cato — Thu, 19 Feb 2004 13:19:17 +0000

Any new functionality can mean security holes, and this applies whether Unicode is implemented in libraries or the kernel. It's important to address Unicode's potential for such holes (overlong UTF-8 encodings etc), but mostly this is just good practice - e.g. you 'filter in' the characters you know are legal, rather than trying to 'filter out' characters that are illegal (it's very easy to miss just one).

I'm not sure Unicode needs to live in the kernel as long as there is good library support, but it's better for library or kernel maintainers to solve these problems once rather than have different buggy implementations in every application.

The specific IIS issues were related to Microsoft's non-standard %uNNNN encoding of 16-bit UCS-2 (Unicode) characters, so I don't think this is a reason to abandon Unicode.

The kernel and character set encodings

danscox — Thu, 19 Feb 2004 13:16:24 +0000

It seems to me like this would be a perfect place for either FUSE, or a "settable" policy mechanism within the kernel. Even that can get hairy, of course, for many and varied reasons, but it would leave policy in userland, where it should be. This could possibly start up a whole set of 'cottage industries'; modules to support this or that file naming convention. I'm thinking of Firefox and it's extensions, for example.

Danny

The kernel and character set encodings

Cato — Thu, 19 Feb 2004 13:12:18 +0000

These encodings are fine where the users agree on a single character set (e.g. KOI8-R in Russia) or where there is some external data (e.g. the directory name or file name including 'koi8-r') describing the character set of the file. I am very aware that there may be conversion problems, which is why Unicode is important, but not everyone is going to move to Unicode straight away - there are still gaps in the user level tools available, though they are improving.

What might be useful is to document the legacy non-Unicode character sets that are incompatible with ASCII and in particular *nix filesystems - so far, I believe that HZ-*, ISO-2022-* and Big5 are all incompatible, but it would be good to see a definitive list. Then at least Linux users would know which character sets to avoid for filenames.

The issue of invalid UTF-8 strings is no different to any other mis-encoded characters - it would be good if glibc or perhaps the kernel checked UTF-8 for overlong characters, as this is a well known security hole and it's not hard to do this.

The kernel and character set encodings

mwh — Thu, 19 Feb 2004 12:14:14 +0000

> Unicode makes life more complicated for everyone

  If Unicode is a horde of zombies with flaming dung sticks, 
  the hideous intricacies of JIS, Chinese Big-5, Chinese 
  Traditional, KOI-8, et cetera are at least an army of ogres 
  with salt and flensing knives.        -- Eric S. Raymond, python-dev

Unicode isn't that hard to deal with, although I'd admit to not having any intuition for what the right answer is in this situation.

The kernel and character set encodings

ibukanov — Thu, 19 Feb 2004 11:18:03 +0000

> These strings result in exactly the same visual appearance on screen, yet they can't be compared with a byte comparison.

You do not need even Unicode normalization for that. In most fonts the following two lines would have exactly the same visual presentation (you have to view the page with UTF-8 encoding as LWN does not allow to enter РОТ in HTML comments due to bugs in recognition of &code; escapes):
POT
Ð ÐÐ¢
yet the first uses pure ASCII and the second uses only Cyrillic characters and means mouth in Russian.

IMHO such examples supports the notion that kernel should not impose any policy on file names encoding as in practice there are always more then one way to encode the same visual presentation and UTF-8 with Unicode does not help here.

Unicode bugs

simonl — Thu, 19 Feb 2004 11:06:26 +0000

Unicode in kernel means security holes.

Back in 2001 afair Unicode bugs were found in MS IIS. There are many ways to encode a ../../ path in Unicode, and IIS did not know all of them. However the kernel did, and thus circumvented any path sanitizing IIS did.

Linux should not repeat these mistakes.

We would have to fix every little script that deals with userdefined file names, it is impossible. Input validation is hard enough already.

The kernel and character set encodings

one2team — Thu, 19 Feb 2004 09:54:57 +0000

« You say that the only practical choices for character encodings are ISO-8859-1 and UTF-8. In fact, there is a vast range of encodings that will work (basically any encoding that doesn't use NUL and '/' for some other purpose than ASCII semantics). For a start there is ISO-8859-*, KOI8-* (for Cyrillic), EUC-JP, Shift-JIS (both popular in Japan), and so on. »

These encodings are mostly useless in a true multi-user system. Why ? Because they are all incompatible. So there is no way for a user that uses encoding A to read stuff (including filenames) made by another user using encoding B. And this is true even for close stuff (KOI8-U and KOI8-R for example). Not to speak of the poor users that may want to quote another langage (French + Russian, Welsh + Greek etc).

The only thing all those encodings are compatible with is english, which restricts second language to english and english only.

One could argue userspace would have just to use Greek encoding for Greek filenames, Russian for Russian ones and so on. But the crux of the problem is userspace have no way to request or guess what encoding was used to write a filename, since the kernel does not enforce any particular encoding nor provides encoding info to userspace.

One additionnal problem is some byte strings can result in invalid UTF-8 and cause applications to barf if they try to decode them.

The kernel and character set encodings

Cato — Thu, 19 Feb 2004 09:24:40 +0000

You say that the only practical choices for character encodings are ISO-8859-1 and UTF-8. In fact, there is a vast range of encodings that will work (basically any encoding that doesn't use NUL and '/' for some other purpose than ASCII semantics). For a start there is ISO-8859-*, KOI8-* (for Cyrillic), EUC-JP, Shift-JIS (both popular in Japan), and so on.

Getting the character encoding right is difficult, and with UTF-8 there is an additional complication, Unicode normalisation - the issue here is that in certain languages, you might have a symbol on the page being encoded as 3 Unicode characters: the letter with accent 1 then accent 2 in one string, and the letter with accent 2 then accent 1 in another string. These strings result in exactly the same visual appearance on screen, yet they can't be compared with a byte comparison. Unicode normalisation defines a specific order for all such 'combining character' strings, but unfortunately there is more than one normalisation form: Linux and the W3C use NFC, while Darwin and MacOS X use NFD, even on UFS filesystems.

Unicode makes life more complicated for everyone and it's likely some of this needs to be in the kernel, or at least glibc, for uniformity. For more links on Unicode, from a Perl/Wiki oriented perspective, see the plan for TWiki support of UTF-8 and this Unicode normalisation page.

How would a case-insensitive magic_open() call work?

brouhaha — Thu, 19 Feb 2004 09:11:08 +0000

Suppose I have two files named "Foobar" and "foobaR" in a particular directory. The user (possibly Samba) calls magic_open("foobar", ...). What can be expected to happen?

I think this proposed magic_open() call is almost as bad an idea as providing an option to allow normal open()s (or the filesystem code) to be case insensitive. The few applications that really need this sort of behavior should implement it in user space by reading the directory, and they can worry about how to handle ambiguous cases there.