Rethinking fsinfo()
Rethinking fsinfo()
Posted Aug 24, 2020 4:51 UTC (Mon) by mathstuf (subscriber, #69389)In reply to: Rethinking fsinfo() by excors
Parent article: Rethinking fsinfo()
In what endianness do you treat the incoming 16bit data? Big? Little? Native? Native is easy, but it means you need to know what the host system is before archiving the raw data. Little is easy, but then can be confusing in the raw data viewers (which could render backwards). BOM is ok? But you could also start a filename with a BOM and…blah.
Posted Aug 24, 2020 12:30 UTC (Mon)
by excors (subscriber, #95769)
[Link] (4 responses)
Most code that processes the path should treat it as an opaque blob or decode it to wchar_t*, and wouldn't need to care about Unicode or surrogates etc.
When you want to display the path to a user, you'd need to do a lossy UTF-16LE decode to get a real Unicode string to pass into your UI system. (Lossy because the path might contain unpaired surrogates which you can't decode safely). (If you're using the Win32 UI APIs, that decoding will probably happen implicitly inside the API implementation; otherwise you might need to do it in the application). The important thing is to avoid trying to decode into a real Unicode string in any context where the lossiness will cause worse than a cosmetic glitch. (So you shouldn't try to store Windows paths directly in JSON, because interoperable JSON requires real Unicode strings, hence the base64 encoding.)
(Linux is the same except 8-bit instead of 16-bit, and probably UTF-8 (or the user's current locale, as NYKevin mentioned, though of course they might have files created under a different locale and there's no way to be sure what they were meant to be) instead of almost always UTF-16LE, and you can encode arbitrary 8-bit strings as JSON strings much more easily than encoding arbitrary 16-bit strings (where you need base64 etc). On both platforms it's a mistake to think that a path is simply an encoded Unicode string, and that you can decode/encode at the edge and do all your internal processing with Unicode.)
Posted Aug 25, 2020 19:02 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
I agree that just stuffing paths into binary storage is the best solution. However, usually paths need displayed or the storage you're using has a human caring about it at some point in its lifetime. Especially if you're using a container format like JSON. It's nice and all, but a way to store arbitrary binary data without having to figure out how to encode it so that it is Unicode safe would have been much appreciated. (No, BSON don't fix this; they just change the window dressing from `{:"",}` into type-and-length-prefixed fields or type-and-NUL-terminated sequences). CBOR has binary data, but then library support is more widely lacking.
FWIW, I've spent a lot of time thinking about how to stuff paths into JSON: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p...
Posted Aug 26, 2020 19:43 UTC (Wed)
by unilynx (guest, #114305)
[Link] (2 responses)
Distributions might slowly make that option the default for new systems, sysadmins can opt-in faster themselves, unless they really have to deal with those few applications (which will hopefully disappear or become obsolete fast) that really, really want to create weird filenames.
Posted Aug 27, 2020 4:48 UTC (Thu)
by neilbrown (subscriber, #359)
[Link]
Posted Aug 29, 2020 11:29 UTC (Sat)
by flussence (guest, #85566)
[Link]
Rethinking fsinfo()
Rethinking fsinfo()
Rethinking fsinfo()
Rethinking fsinfo()
Excluding end-of-line characters is probably justifiable too. (or any control char ... I don't think we need TAB or DEL).
Anything else is parochial.
When I'm choosing a name to save my document from my GUI, why should I care about your inability to write safe shell scripts, or even have any understanding that "the shell" exists.
It is bad enough that I cannot put a '/' in my file names, why would you prevent me using '$'??
Rethinking fsinfo()