|
|
Subscribe / Log in / New account

Wheeler: Fixing Unix/Linux/POSIX Filenames

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 16:41 UTC (Sat) by zooko (guest, #2589)
In reply to: Wheeler: Fixing Unix/Linux/POSIX Filenames by tialaramex
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames

"The filenames you get from NT will be sequences of 16-bit code units. They might be Unicode. The filenames you get from Linux will be sequences of 8-bit code units. They might be Unicode (in this case UTF-8) too."

I don't think this is true. The bytes in the filenames in NT are defined to be UTF-16 encodings of characters. The bytes in the filenames in Mac are defined to be UTF-8 encodings. The bytes in the filenames in Linux are not defined to be any particular encoding. It isn't just a definitional issue -- the result is that reading a filename from the Windows or Mac filesystem can't cause you to lose information -- the filename you get in your application is guaranteed to be the same as the filename that is stored in the filesystem. On the other hand, when you read a filename from the filesystem in Linux, then you need to decide how to attempt to decode it, and there is no way to guarantee that your attempt won't corrupt the filename.

Please correct me if I'm wrong, because I'm intending to make the Tahoe p2p disk-sharing app depend on this guarantee from Windows (and from Mac), and to make it painfully work around the lack of this guarantee in Linux.


to post comments

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 17:45 UTC (Sat) by tialaramex (subscriber, #21167) [Link]

To be quite clear about what I'm saying and what I'm not saying:

* I am saying that you can't guarantee that the filenames Windows gives you are all legal UTF-16 Unicode strings. Windows makes no such promise. Non-Win32 programs (including Win32 programs which also use native low-level APIs) may create files which don't obey the convention, and filenames on disk or from a network filesystem are not checked to see if they are valid UTF-16.

* I am NOT saying that there are people running Windows whose filenames are all in SJIS or ISO-8859-8 or even Windows codepage 1252. That would be silly because those encodings (and indeed practically all legacy encodings) are 8-bit and all filenames in Windows are 16-bit. When a Windows filename "means something" at all, the meaning will be encoded as UTF-16, or perhaps if you're really unlucky, UCS-2.

So if your problem is "People keep running my program with crazy locale settings and legacy encodings of filenames" well you have my sympathy, and yes you will need to handle this for Linux (even if only by writing a FAQ entry telling them to switch to UTF-8) and might get away without on Windows.

But if the problem is "My program blindly assumes filenames are legal Unicode strings" then you're in a bad way, stop doing that because it's a bug at least on Linux and Windows, and IMO most likely on Mac OS X too (though their documentation claims otherwise).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 18:54 UTC (Sat) by foom (subscriber, #14868) [Link]

> The bytes in the filenames in NT are defined to be UTF-16 encodings of characters.

That's not actually true. The windows APIs take arrays of 16-bit "things". Those are supposed to be
UTF-16, but none of the APIs will check that. So, you can easily create invalid surrogate pair
sequences. Now, it's a *lot* easier to ignore this issue on windows than on linux, because:

a) The set of invalid sequences in UTF-16 is a lot smaller than in UTF-8.
b) Nobody creates those by accident. It won't happen just because you set your LOCALE wrong.
c) the windows Unicode APIs are all 16-bit unicode, so they never try decoding the surrogate pair
sequences anyways
d) Even UTF-16->UTF-32 decoders often decode a lone surrogate pair in UTF-16 into a lone
surrogate pair value in UTF-32 (even though it's theoretically not supposed to do that).


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds