Rethinking fsinfo()

Posted Aug 24, 2020 4:51 UTC (Mon) by mathstuf (subscriber, #69389)
In reply to: Rethinking fsinfo() by excors
Parent article: Rethinking fsinfo()

> In that case it's probably safer to treat them as binary data and encode with base64

In what endianness do you treat the incoming 16bit data? Big? Little? Native? Native is easy, but it means you need to know what the host system is before archiving the raw data. Little is easy, but then can be confusing in the raw data viewers (which could render backwards). BOM is ok? But you could also start a filename with a BOM and…blah.

Rethinking fsinfo()

Posted Aug 24, 2020 12:30 UTC (Mon) by excors (subscriber, #95769) [Link] (4 responses)

Since this is about Windows, and Windows is always little-endian (except on Xbox 360, as far as I can tell), it seems obvious to use little-endian here. Since it's not Unicode (it's just an array of 16-bit values) there's no reason to even think about BOMs. You'd simply take the LPCWSTR path (i.e. const wchar_t*, where sizeof(wchar_t)==2) which is used by the Win32 APIs, then cast to uint8_t* and base64-encode as normal. That seems easy.

Most code that processes the path should treat it as an opaque blob or decode it to wchar_t*, and wouldn't need to care about Unicode or surrogates etc.

When you want to display the path to a user, you'd need to do a lossy UTF-16LE decode to get a real Unicode string to pass into your UI system. (Lossy because the path might contain unpaired surrogates which you can't decode safely). (If you're using the Win32 UI APIs, that decoding will probably happen implicitly inside the API implementation; otherwise you might need to do it in the application). The important thing is to avoid trying to decode into a real Unicode string in any context where the lossiness will cause worse than a cosmetic glitch. (So you shouldn't try to store Windows paths directly in JSON, because interoperable JSON requires real Unicode strings, hence the base64 encoding.)

(Linux is the same except 8-bit instead of 16-bit, and probably UTF-8 (or the user's current locale, as NYKevin mentioned, though of course they might have files created under a different locale and there's no way to be sure what they were meant to be) instead of almost always UTF-16LE, and you can encode arbitrary 8-bit strings as JSON strings much more easily than encoding arbitrary 16-bit strings (where you need base64 etc). On both platforms it's a mistake to think that a path is simply an encoded Unicode string, and that you can decode/encode at the edge and do all your internal processing with Unicode.)

Rethinking fsinfo()

Posted Aug 25, 2020 19:02 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (3 responses)

> Most code that processes the path should treat it as an opaque blob or decode it to wchar_t*, and wouldn't need to care about Unicode or surrogates etc.

I agree that just stuffing paths into binary storage is the best solution. However, usually paths need displayed or the storage you're using has a human caring about it at some point in its lifetime. Especially if you're using a container format like JSON. It's nice and all, but a way to store arbitrary binary data without having to figure out how to encode it so that it is Unicode safe would have been much appreciated. (No, BSON don't fix this; they just change the window dressing from `{:"",}` into type-and-length-prefixed fields or type-and-NUL-terminated sequences). CBOR has binary data, but then library support is more widely lacking.

FWIW, I've spent a lot of time thinking about how to stuff paths into JSON: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p...

Rethinking fsinfo()

Posted Aug 26, 2020 19:43 UTC (Wed) by unilynx (guest, #114305) [Link] (2 responses)

I hope for a future where someone introduces a 'sane-names' filesystem mount option, which will forbid the use of invalid UTF8, filenames starting with a dash or space, containing dollar signs, and all the other funny things that make processing filenames hard or dangerous. Spaces in filenames we probably have to live with.

Distributions might slowly make that option the default for new systems, sysadmins can opt-in faster themselves, unless they really have to deal with those few applications (which will hopefully disappear or become obsolete fast) that really, really want to create weird filenames.

Rethinking fsinfo()

Posted Aug 27, 2020 4:48 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Requiring valid UTF-8 is probably sensible for a new filesystem.
Excluding end-of-line characters is probably justifiable too. (or any control char ... I don't think we need TAB or DEL).
Anything else is parochial.
When I'm choosing a name to save my document from my GUI, why should I care about your inability to write safe shell scripts, or even have any understanding that "the shell" exists.
It is bad enough that I cannot put a '/' in my file names, why would you prevent me using '$'??

Rethinking fsinfo()

Posted Aug 29, 2020 11:29 UTC (Sat) by flussence (guest, #85566) [Link]

We're slowly getting there, the kernel has Unicode normalisation for filesystems at long last. I think we could live without ASCII control chars next, though I don't agree that we should forbid filenames from containing strings that an average person at a regular keyboard could type. Computers are meant to serve people, not the other way around.