LWN: Comments on "The kernel and character set encodings" https://lwn.net/Articles/71472/ This is a special feed containing comments posted to the individual LWN article titled "The kernel and character set encodings". en-us Fri, 05 Sep 2025 15:09:45 +0000 Fri, 05 Sep 2025 15:09:45 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Re-mount Through Caseless VFS? https://lwn.net/Articles/72434/ https://lwn.net/Articles/72434/ massimiliano <br>I am definitely not a kernel developer, but this sounds like<br>the perfect solution: very general, and perfectly decoupled<br>from the code of existing filesystems... and moreover, you pay<br>the performance penalty only if you use the feature.<p>As an added benefit, it could be implemented entirely in user<br>space using FUSE, and only if/when it works very well (and the<br>added performance is needed) as a kernel module.<p>With such a layer, it would also be possible to handle all those<br>nasty Unicode normalizations...<p>Just my two cents, anyway. Mon, 23 Feb 2004 07:40:14 +0000 Re-mount Through Caseless VFS? https://lwn.net/Articles/72425/ https://lwn.net/Articles/72425/ miallen Why not just create a &quot;casefs&quot; VFS that just uses the existing ops for the target mounted fs but overloads lookup() to do the caseless pathwalk (and maybe save the last N paths with hashes in a separate cache)? Now you would just (re)mount an existing fs through this casefs VFS. It wouldn't be optimal but it would still be a lot faster for Samba, WINE, or whoever and it wouldn't barf all over any other kernel code. It's probably not a lot of code either.<p>Mike Mon, 23 Feb 2004 00:37:01 +0000 The kernel and character set encodings https://lwn.net/Articles/72371/ https://lwn.net/Articles/72371/ fiberbit The problem lies in the checking whether or not a file with a given name (case insensitive) exists. Say you do an 'fp = fopen(&quot;filename&quot;, &quot;a&quot;), and &quot;filename&quot; doesn't exist yet, then in the case-insensitive case, you have to check whether &quot;Filename&quot; or &quot;fIlename&quot; or any other variant *does* exist.<br>You'd either have to try all possible combinations, or (in practice) scan the whole directory to see if any name matches (and use the first). This not only is very time consuming, but also racy in a multi process environment.<br>It could be solved by using case-insensitive hash functions in the dentry cache, but that would negatively impact normal filesystems, and is unacceptable to most, including the top penguin. Sat, 21 Feb 2004 21:14:37 +0000 The kernel and character set encodings https://lwn.net/Articles/72324/ https://lwn.net/Articles/72324/ Cato This problem needs to be addressed somewhere, though not necessarily in the kernel (perhaps in glibc or the GUI layer): two users create identical looking filenames using Vietnamese accented characters (letter + 2 accents in different order, 3 Unicode characters altogher). Then, there are two identical-looking filenames and you don't know how to type the 'right' one. Even if there is only one file involved, without Unicode normalisation you wouldn't be able to use bash filename completion, since you might type the accents in a different order to that used in the filename, though there would be no visual clue as to your mistake.<p>Given these issues, which affect command line tools as much as GUIs, it may be sensible to put NFC normalisation in glibc or the kernel, despite the complexity. Files created from another system on a Linux NFS filesystem would of course bypass glibc, so the alternatives are batch renormalisation (always an option, convmv may do this) or putting NFC in the kernel.<p>It's not good enough to say 'case-insensitivity should not be in the kernel' - you need to address these use cases and say how and where you would solve them.<p> Sat, 21 Feb 2004 07:49:26 +0000 The kernel and character set encodings https://lwn.net/Articles/72285/ https://lwn.net/Articles/72285/ spitzak Could somebody explain why the case-insensitivity is so important, even <br>for Samba? It seems to me there cannot be too many Windows programs that <br>take a filename provided to it by the system and change the case before <br>using it. My tests show that when you double-click files in Explorer or <br>from the file chooser or any other way I found to select the files, <br>Windows gave my program the filename with the exact same case as it was <br>reported in the file listing.<p>Yes users can type in the wrong case into a shell, but aren't <br>command-line interfaces supposed to be &quot;unfriendly&quot;? Why does anybody <br>care if user-unfriendly interfaces work for stupid users or not?<p> Fri, 20 Feb 2004 22:37:34 +0000 The kernel and character set encodings https://lwn.net/Articles/72284/ https://lwn.net/Articles/72284/ spitzak I agree that a length is needed, not just for encoding NUL, but to allow <br>a slash-seperated name to be quickly converted to this form, without a <br>need to malloc and copy a block of memory for each piece.<p>One possibly less-ugly scheme is to use Plan9's &quot;walk&quot; style. You have <br>&quot;file descriptors&quot; that represent a filename, unopened as yet. These are <br>created by copying an <br>existing one (a small set, such as one for &quot;/&quot;, are provided when the <br>program starts up, like stdin/out). There is then a call something like <br>walk(fd,char* name,int length) which moves fd to the subdirectory in <br>name[0..length-1]. When you finally are at the desired file you call <br>open(fd,mode). Existing open() calls would be turned into a bunch of walk <br>calls followed by a new open.<p>With this, no arrays are passed to the kernel, and it does not have to <br>store these arrays.<p> Fri, 20 Feb 2004 22:32:13 +0000 Unicode bugs https://lwn.net/Articles/72283/ https://lwn.net/Articles/72283/ spitzak Avoiding those bugs is one of the primary reasons why UTF-8 is a good <br>idea.<p>&quot;../&quot; in a UTF-8 filename means the *BYTES* for '.', '.', and '/' appear <br>next to each other. It is entirely irrelevant if the UTF-8 string is <br>legal or if it contains a byte sequence that some broken software by <br>Microsoft will turn into a slash.<p>I don't know how many times this has to be stated. But if your program is <br>looking at a UTF-8 string and is doing anything other than drawing the <br>characters on the screen, YOU DO NOT NEED TO DECODE IT! Just look at the <br>bytes!<p> Fri, 20 Feb 2004 22:23:06 +0000 The kernel and character set encodings https://lwn.net/Articles/72281/ https://lwn.net/Articles/72281/ spitzak There is no problem with UTF-8 filenames. The bytes should be stored <br>unchanged, and unchanged bytes should be used to look up the file. It <br>does not matter if those bytes are a legal UTF-8 string or not, to say <br>nothing of what normalization form they are.<p>Unfortunately there are hordes of people out there who think dumb ideas <br>like case-insensitivity should be applied at low levels to stuff that <br>really is binary data. This kind of thinking is what causes complexity, <br>and complexity causes bugs and security holes.<p>Any program that takes a string it thinks is UTF-8 and does <br>&lt;i&gt;ANYTHING&lt;/i&gt; other than pass the exact bytes unchanged to another <br>interface that wants UTF-8 is by definition broken. This simple rule will <br>completely eliminate all ambiguity about UTF-8.<p><p><br> Fri, 20 Feb 2004 22:19:48 +0000 The kernel and character set encodings https://lwn.net/Articles/72219/ https://lwn.net/Articles/72219/ Ross But you are using C strings to denote the elements which means they are<br>still NUL terminated. To fix it you need a second array for the path<br>component lengths. I think you are unlikely to convince any of the kernel<br>guys this isn't too ugly to live. Fri, 20 Feb 2004 17:37:36 +0000 A few problems https://lwn.net/Articles/72217/ https://lwn.net/Articles/72217/ Ross 1) What filesystems support per-file character set selection? Which ones<br> can handle embedded NUL characters? What about maximum filename length<br> considerations -- you are no longer measuring in characters because the<br> number of bytes they use depends on the encoding.<p>2) There are a whole lot of system calls receiving or returning filenames<br> (the libc routines linke fopen() are a different layer): open(),<br> getdirentries(), readlink(), stat(), lstat(), rename(), unlink(), link(),<br> mknod(), chown(), chmod(), utime(), mount() etc. (not to mention Unix<br> domain sockets). These would all have to change. But POSIX defines them<br> as taking certain parameters and having certain return types. So you<br> either have to drop Unix compatability or you have to add duplicate<br> versions of each one much like Microsoft did when converting to UCS2.<p>3) What about Unix applications and old Linux apps? They won't even<br> compile if you change the prototypes. If you don't and make the old<br> system calls default to UTF8 or something you still have to make them work<br> with filenames in other encodings.<p>4) It won't fix the policy problem without involving the kernel anyway.<br> What about case insensitivity, canonicalizing characters, path delimeters<br> etc.? You removed the need for the terminating NUL but what about the &quot;/&quot;<br> character? What about character sets with no slash, or with multiple<br> slashes? The kernel will need to know what these are and that will depend<br> on the character set.<br> Fri, 20 Feb 2004 17:35:21 +0000 The kernel and character set encodings https://lwn.net/Articles/72119/ https://lwn.net/Articles/72119/ flewellyn Why does the kernel even use &quot;/&quot; and NUL? Seriously, pathnames should be internally coded as structures, not strings. The only parsing of pathname strings should occur in the C library, including syscall wrappers. The kernel should not have any notion, internally, of pathname separators. It's just silly.<p>Instead, I propose something like this: stick each element of the pathname into an array element, innermost first (that is, the &quot;root&quot; directory would be LAST element of the array), and use a special token to indicate ROOT. You could have the array live in a struct, with the other struct element being the length of the array, if you like. Something like this:<p>struct pathname {<br> int length;<br> char* elements[];<br>}<p>This way you could get at the file's name with a simple elements[0], and walk the directory tree from root to the file like this:<p>for (i = length; i &gt;=0; i--) {blah blah blah whatever};<p>No worrying about parsing out the &quot;/&quot; separators. Fri, 20 Feb 2004 02:24:17 +0000 How would a case-insensitive magic_open() call work? https://lwn.net/Articles/72039/ https://lwn.net/Articles/72039/ chad.netzer You get an arbitrary file. Tridge suggests that (so far) he hasn't gotten complaints about this kind of behavior (which already exists in Samba), and there are few good alternatives. One possible alternative, to try to keep track of which files are created by Posix systems, and which are created by Windows systems, and preferentially decide between the two, seems like too much work if no one really cares.<p>The whole case insensitivity issue of Windows is (apparently) a mess, and there appears to be no perfect policy about what to do when interoperating, other than try to do the thing which makes most practical sense.<br> Thu, 19 Feb 2004 19:42:02 +0000 Method for (mostly) kernel-independant Unicode filenames? https://lwn.net/Articles/72000/ https://lwn.net/Articles/72000/ Max.Hyre <p><i>[Strawman proposal---please point me toward discussions where it's all been hashed out, shot down, &amp;c. Or, just flame direct.]</i> <p>How about changing filename semantics (and, of course, every filesystem known to Linux): make filenames a three-element struct: a fixed-length specification of the name's character-set encoding, a fixed-length count of the <i>bytes</i> in the name, and a variable-length string holding said name:<br> <pre> struct filename { enum encoding enc; int cb; byte *rgb; }; </pre> <p>Now, the kernel doesn't give a fig what the encoding is, or what it might mean---it's all bytes, with no chance (hah!) of filename buffer overflows and their attendant dangers to root. The libraries just use the struct for calls to <code>fopen()</code>, <code>remove()</code>, <code>rename()</code>, &amp; friends, with the caller allowed to specify that <ul> <li>an exact match (on all elements of the struct) is needed for equality comparisons, <li>a bytewise match on the <code>byte *</code>s, regardless of the encoding, is sufficient, or <li>its own comparator function (supplied) be run on pairs of the structs. </ul> <p>The kernel code is encoding-agnostic, and the rest of the work (emphatically including sorting) is in userland. Thu, 19 Feb 2004 16:28:00 +0000 Unicode cannot be secure---B. Schneier https://lwn.net/Articles/71999/ https://lwn.net/Articles/71999/ Max.Hyre <p>Well, that should get their attention. :-) The exact wording was ``Unicode is just too complex to ever be secure.'' <p>In his July 2000 Crypto-gram <a href="http://www.schneier.com/crypto-gram-0007.html#9">article on Unicode</a>, Schneier points up the failures we've had dealing with ASCII control characters, escape sequences, different semantics at different levels of the application (think writing a <code>bash</code> command to grep for a particular <code>grep</code> regular expression), and concludes that with Unicode it's not merely hard, it's effectively impossible. <p>I don't know enough about Unicode to argue the details, but it certainly made me sit up and take notice. Thu, 19 Feb 2004 15:28:42 +0000 Unicode bugs https://lwn.net/Articles/71981/ https://lwn.net/Articles/71981/ Cato Any new functionality can mean security holes, and this applies whether Unicode is implemented in libraries or the kernel. It's important to address Unicode's potential for such holes (overlong UTF-8 encodings etc), but mostly this is just good practice - e.g. you 'filter in' the characters you know are legal, rather than trying to 'filter out' characters that are illegal (it's very easy to miss just one).<p>I'm not sure Unicode needs to live in the kernel as long as there is good library support, but it's better for library or kernel maintainers to solve these problems once rather than have different buggy implementations in every application.<p>The specific IIS issues were related to Microsoft's non-standard %uNNNN encoding of 16-bit UCS-2 (Unicode) characters, so I don't think this is a reason to abandon Unicode. Thu, 19 Feb 2004 13:19:17 +0000 The kernel and character set encodings https://lwn.net/Articles/71979/ https://lwn.net/Articles/71979/ danscox It seems to me like this would be a perfect place for either FUSE, or a &quot;settable&quot; policy mechanism within the kernel. Even that can get hairy, of course, for many and varied reasons, but it would leave policy in userland, where it should be. This could possibly start up a whole set of 'cottage industries'; modules to support this or that file naming convention. I'm thinking of Firefox and it's extensions, for example.<p>Danny Thu, 19 Feb 2004 13:16:24 +0000 The kernel and character set encodings https://lwn.net/Articles/71978/ https://lwn.net/Articles/71978/ Cato These encodings are fine where the users agree on a single character set (e.g. KOI8-R in Russia) or where there is some external data (e.g. the directory name or file name including 'koi8-r') describing the character set of the file. I am very aware that there may be conversion problems, which is why Unicode is important, but not everyone is going to move to Unicode straight away - there are still gaps in the user level tools available, though they are improving.<p>What might be useful is to document the legacy non-Unicode character sets that are incompatible with ASCII and in particular *nix filesystems - so far, I believe that HZ-*, ISO-2022-* and Big5 are all incompatible, but it would be good to see a definitive list. Then at least Linux users would know which character sets to avoid for filenames.<p>The issue of invalid UTF-8 strings is no different to any other mis-encoded characters - it would be good if glibc or perhaps the kernel checked UTF-8 for overlong characters, as this is a well known security hole and it's not hard to do this.<br> Thu, 19 Feb 2004 13:12:18 +0000 The kernel and character set encodings https://lwn.net/Articles/71930/ https://lwn.net/Articles/71930/ mwh &gt; Unicode makes life more complicated for everyone <pre> If Unicode is a horde of zombies with flaming dung sticks, the hideous intricacies of JIS, Chinese Big-5, Chinese Traditional, KOI-8, et cetera are at least an army of ogres with salt and flensing knives. -- Eric S. Raymond, python-dev </pre> Unicode isn't <i>that</i> hard to deal with, although I'd admit to not having any intuition for what the right answer is in this situation. Thu, 19 Feb 2004 12:14:14 +0000 The kernel and character set encodings https://lwn.net/Articles/71920/ https://lwn.net/Articles/71920/ ibukanov <p><i> &gt; These strings result in exactly the same visual appearance on screen, yet they can't be compared with a byte comparison. </i></p> <p> You do not need even Unicode normalization for that. In most fonts the following two lines would have exactly the same visual presentation (you have to view the page with UTF-8 encoding as LWN does not allow to enter &amp;#1056;&amp;#1054;&amp;#1058; in HTML comments due to bugs in recognition of &amp;code; escapes): <br> POT <br> РОТ <br> yet the first uses pure ASCII and the second uses only Cyrillic characters and means <i>mouth</i> in Russian. </p> <p> IMHO such examples supports the notion that kernel should not impose any policy on file names encoding as in practice there are always more then one way to encode the same visual presentation and UTF-8 with Unicode does not help here. Thu, 19 Feb 2004 11:18:03 +0000 Unicode bugs https://lwn.net/Articles/71923/ https://lwn.net/Articles/71923/ simonl Unicode in kernel means security holes.<p>Back in 2001 afair Unicode bugs were found in MS IIS. There are many ways to encode a ../../ path in Unicode, and IIS did not know all of them. However the kernel did, and thus circumvented any path sanitizing IIS did.<p>Linux should not repeat these mistakes.<p>We would have to fix every little script that deals with userdefined file names, it is impossible. Input validation is hard enough already.<br> Thu, 19 Feb 2004 11:06:26 +0000 The kernel and character set encodings https://lwn.net/Articles/71916/ https://lwn.net/Articles/71916/ one2team « You say that the only practical choices for character encodings are ISO-8859-1 and UTF-8. In fact, there is a vast range of encodings that will work (basically any encoding that doesn't use NUL and '/' for some other purpose than ASCII semantics). For a start there is ISO-8859-*, KOI8-* (for Cyrillic), EUC-JP, Shift-JIS (both popular in Japan), and so on. »<p>These encodings are mostly useless in a true multi-user system. Why ? Because they are all incompatible. So there is no way for a user that uses encoding A to read stuff (including filenames) made by another user using encoding B. And this is true even for close stuff (KOI8-U and KOI8-R for example). Not to speak of the poor users that may want to quote another langage (French + Russian, Welsh + Greek etc).<p>The only thing all those encodings are compatible with is english, which restricts second language to english and english only.<p>One could argue userspace would have just to use Greek encoding for Greek filenames, Russian for Russian ones and so on. But the crux of the problem is userspace have no way to request or guess what encoding was used to write a filename, since the kernel does not enforce any particular encoding nor provides encoding info to userspace.<p>One additionnal problem is some byte strings can result in invalid UTF-8 and cause applications to barf if they try to decode them. Thu, 19 Feb 2004 09:54:57 +0000 The kernel and character set encodings https://lwn.net/Articles/71913/ https://lwn.net/Articles/71913/ Cato <p>You say that the only practical choices for character encodings are ISO-8859-1 and UTF-8. In fact, there is a vast range of encodings that will work (basically any encoding that doesn't use NUL and '/' for some other purpose than ASCII semantics). For a start there is ISO-8859-*, KOI8-* (for Cyrillic), EUC-JP, Shift-JIS (both popular in Japan), and so on. </p><p> Getting the character encoding right is difficult, and with UTF-8 there is an additional complication, Unicode normalisation - the issue here is that in certain languages, you might have a symbol on the page being encoded as 3 Unicode characters: the letter with accent 1 then accent 2 in one string, and the letter with accent 2 then accent 1 in another string. These strings result in exactly the same visual appearance on screen, yet they can't be compared with a byte comparison. Unicode normalisation defines a specific order for all such 'combining character' strings, but unfortunately there is more than one normalisation form: Linux and the W3C use NFC, while <a href="http://twiki.org/cgi-bin/view/Codev/MacOSXFilesystemEncodingWithI18N">Darwin and MacOS X use NFD</a>, even on UFS filesystems. </p><p> Unicode makes life more complicated for everyone and it's likely some of this needs to be in the kernel, or at least glibc, for uniformity. For more links on Unicode, from a Perl/Wiki oriented perspective, see the <a href="http://twiki.org/cgi-bin/view/Codev/ProposedUTF8SupportForI18N">plan for TWiki support of UTF-8</a> and this <a href="http://twiki.org/cgi-bin/view/Codev/UnicodeNormalisation">Unicode normalisation page</a>. Thu, 19 Feb 2004 09:24:40 +0000 How would a case-insensitive magic_open() call work? https://lwn.net/Articles/71912/ https://lwn.net/Articles/71912/ brouhaha Suppose I have two files named "Foobar" and "foobaR" in a particular directory. The user (possibly Samba) calls magic_open("foobar", ...). What can be expected to happen? <p> I think this proposed magic_open() call is almost as bad an idea as providing an option to allow normal open()s (or the filesystem code) to be case insensitive. The few applications that really need this sort of behavior should implement it in user space by reading the directory, and they can worry about how to handle ambiguous cases there. Thu, 19 Feb 2004 09:11:08 +0000