Rethinking fsinfo() [LWN.net]

Rethinking fsinfo()

Posted Aug 22, 2020 16:36 UTC (Sat) by zyga (subscriber, #81533) [Link]

Well, is that a reason for all the ad-hoc formats? We could pass paths as byte arrays. In reality, most software will have issues with non-UTF8-friendly things anyway, because they may want to display it. It's nice that open(2) does not complain but it's pretty rubbish if no application can ever display that thing without "here are some bytes".

Rethinking fsinfo()

Posted Aug 22, 2020 18:50 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

Realistically, if you want to pass raw bytes through a text-formatted thing, you should be using Base64 or something similar. This of course means that paths become unreadable to humans without decoding, but you could have a flag indicating whether a path has been escaped, and then only escape things that aren't valid UTF-8. Alternatively, you could encode the "bad bytes" with \\x00 through \\xFF, which is valid in a JSON string (the backslash is escaped, so it's "just" a backslash followed by three letters), but could be confused with a real filename (so you would need to invent further escaping for that case, as described in https://xkcd.com/1638/).

Or, if you think letting people create files with ridiculous names was a bad idea to begin with, you could simply declare non-UTF-8 paths unsupported and spit out invalid JSON if the user tries to create one. Much userspace software already does something like that anyway (see for example Python 3's surrogateescape hack). But then a lot of parsers will work just fine the vast majority of the time, and break on an obscure condition that the average engineer may not even realize is possible. So that's probably not ideal...

Rethinking fsinfo()

Posted Aug 23, 2020 15:22 UTC (Sun) by abo (subscriber, #77288) [Link] (9 responses)

Surrogate escapes can be used to encode arbitrary bytes in JSON:

(python)

>>> b = bytes(range(256))
>>> u = b.decode("UTF-8", errors="surrogateescape")
>>> import json
>>> j = json.dumps(u)
>>> uin = json.loads(j)
>>> bin = uin.encode("UTF-8", errors="surrogateescape")
>>> [n for n in bin]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255]
>>> bin == b
True

The JSON looks like this:

"\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\b\t\n\u000b\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007f\udc80\udc81\udc82\udc83\udc84\udc85\udc86\udc87\udc88\udc89\udc8a\udc8b\udc8c\udc8d\udc8e\udc8f\udc90\udc91\udc92\udc93\udc94\udc95\udc96\udc97\udc98\udc99\udc9a\udc9b\udc9c\udc9d\udc9e\udc9f\udca0\udca1\udca2\udca3\udca4\udca5\udca6\udca7\udca8\udca9\udcaa\udcab\udcac\udcad\udcae\udcaf\udcb0\udcb1\udcb2\udcb3\udcb4\udcb5\udcb6\udcb7\udcb8\udcb9\udcba\udcbb\udcbc\udcbd\udcbe\udcbf\udcc0\udcc1\udcc2\udcc3\udcc4\udcc5\udcc6\udcc7\udcc8\udcc9\udcca\udccb\udccc\udccd\udcce\udccf\udcd0\udcd1\udcd2\udcd3\udcd4\udcd5\udcd6\udcd7\udcd8\udcd9\udcda\udcdb\udcdc\udcdd\udcde\udcdf\udce0\udce1\udce2\udce3\udce4\udce5\udce6\udce7\udce8\udce9\udcea\udceb\udcec\udced\udcee\udcef\udcf0\udcf1\udcf2\udcf3\udcf4\udcf5\udcf6\udcf7\udcf8\udcf9\udcfa\udcfb\udcfc\udcfd\udcfe\udcff"

Rethinking fsinfo()

Posted Aug 23, 2020 18:05 UTC (Sun) by excors (subscriber, #95769) [Link] (8 responses)

That goes against the interoperability recommendations of the JSON RFC, which says (https://tools.ietf.org/html/rfc8259#section-8.2):

> the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters; for example, "\uDEAD" (a single unpaired UTF-16 surrogate). Instances of this have been observed, for example, when a library truncates a UTF-16 string without checking whether the truncation split a surrogate pair. The behavior of software that receives JSON texts containing such values is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions.

so it seems a bad idea to rely on unpaired surrogates (like surrogateescape) if you're choosing JSON specifically for its interoperability.

Surely the simplest way to encode Linux's 8-bit paths in JSON is to map the bytes 0x00..0xFF onto U+0000..U+00FF and then proceed as normal. When decoding, treat any element >=U+0100 as a syntax error. That should be interoperable between all JSON implementations, and very easy to handle in both Unicode-aware and -unaware applications.

If the application wants to display the path to a user, do a potentially-lossy UTF-8 decode in the UI layer, which is about the best you can ever do with Linux paths regardless of how they're encoded for transport. For all non-display-related processing of paths (which I think is more common and more important than displaying paths), keep them in the simple lossless U+0000..U+00FF representation.

(Windows' 16-bit paths are more complicated, if you want to handle them pedantically correctly: they can contain unpaired surrogates so you can't simply interpret them as JSON-compatible Unicode strings. In that case it's probably safer to treat them as binary data and encode with base64.)

Rethinking fsinfo()

Posted Aug 23, 2020 19:08 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

> Surely the simplest way to encode Linux's 8-bit paths in JSON is to map the bytes 0x00..0xFF onto U+0000..U+00FF and then proceed as normal. When decoding, treat any element >=U+0100 as a syntax error. That should be interoperable between all JSON implementations, and very easy to handle in both Unicode-aware and -unaware applications.

So, basically, pretend we have LC_ALL="[whatever].ISO-8859-1" at both ends, and then require userspace to clean up the mess if LC_ALL is actually set to a different value (which, on modern systems, is typically the case). The problem, of course, is that if you ever try to decode that JSON with a naive implementation, you will get mojibake since they will skip the "clean up the mess" step. So you still need non-naive implementations, which makes me wonder, why bother with JSON in the first place?

> If the application wants to display the path to a user, do a potentially-lossy UTF-8 decode in the UI layer, which is about the best you can ever do with Linux paths regardless of how they're encoded for transport.

Strictly, you should be consulting the locale information rather than just assuming UTF-8. UTF-8 is the most common encoding, but its use in pathnames is not required by any standard that I'm aware of. Now, you can't use something too weird such as UTF-16 (null bytes not allowed), but legacy 8-bit encodings are very much legal and valid on some older systems.

Rethinking fsinfo()

Posted Aug 23, 2020 22:44 UTC (Sun) by shemminger (subscriber, #5739) [Link]

Text interfaces to userspace are brittle (easily broken) and suck. If you look at some of the interface in /proc/net there are columns filled with zeros because some field existed in 2.2 and can never change.

Message based interfaces like netlink are more slightly more difficult to program but offer opportunity for expansion.

Rethinking fsinfo()

Posted Aug 24, 2020 4:51 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (5 responses)

> In that case it's probably safer to treat them as binary data and encode with base64

In what endianness do you treat the incoming 16bit data? Big? Little? Native? Native is easy, but it means you need to know what the host system is before archiving the raw data. Little is easy, but then can be confusing in the raw data viewers (which could render backwards). BOM is ok? But you could also start a filename with a BOM and…blah.

Rethinking fsinfo()

Posted Aug 24, 2020 12:30 UTC (Mon) by excors (subscriber, #95769) [Link] (4 responses)

Since this is about Windows, and Windows is always little-endian (except on Xbox 360, as far as I can tell), it seems obvious to use little-endian here. Since it's not Unicode (it's just an array of 16-bit values) there's no reason to even think about BOMs. You'd simply take the LPCWSTR path (i.e. const wchar_t*, where sizeof(wchar_t)==2) which is used by the Win32 APIs, then cast to uint8_t* and base64-encode as normal. That seems easy.

Most code that processes the path should treat it as an opaque blob or decode it to wchar_t*, and wouldn't need to care about Unicode or surrogates etc.

When you want to display the path to a user, you'd need to do a lossy UTF-16LE decode to get a real Unicode string to pass into your UI system. (Lossy because the path might contain unpaired surrogates which you can't decode safely). (If you're using the Win32 UI APIs, that decoding will probably happen implicitly inside the API implementation; otherwise you might need to do it in the application). The important thing is to avoid trying to decode into a real Unicode string in any context where the lossiness will cause worse than a cosmetic glitch. (So you shouldn't try to store Windows paths directly in JSON, because interoperable JSON requires real Unicode strings, hence the base64 encoding.)

(Linux is the same except 8-bit instead of 16-bit, and probably UTF-8 (or the user's current locale, as NYKevin mentioned, though of course they might have files created under a different locale and there's no way to be sure what they were meant to be) instead of almost always UTF-16LE, and you can encode arbitrary 8-bit strings as JSON strings much more easily than encoding arbitrary 16-bit strings (where you need base64 etc). On both platforms it's a mistake to think that a path is simply an encoded Unicode string, and that you can decode/encode at the edge and do all your internal processing with Unicode.)

Rethinking fsinfo()

Posted Aug 25, 2020 19:02 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (3 responses)

> Most code that processes the path should treat it as an opaque blob or decode it to wchar_t*, and wouldn't need to care about Unicode or surrogates etc.

I agree that just stuffing paths into binary storage is the best solution. However, usually paths need displayed or the storage you're using has a human caring about it at some point in its lifetime. Especially if you're using a container format like JSON. It's nice and all, but a way to store arbitrary binary data without having to figure out how to encode it so that it is Unicode safe would have been much appreciated. (No, BSON don't fix this; they just change the window dressing from `{:"",}` into type-and-length-prefixed fields or type-and-NUL-terminated sequences). CBOR has binary data, but then library support is more widely lacking.

FWIW, I've spent a lot of time thinking about how to stuff paths into JSON: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p...

Rethinking fsinfo()

Posted Aug 26, 2020 19:43 UTC (Wed) by unilynx (guest, #114305) [Link] (2 responses)

I hope for a future where someone introduces a 'sane-names' filesystem mount option, which will forbid the use of invalid UTF8, filenames starting with a dash or space, containing dollar signs, and all the other funny things that make processing filenames hard or dangerous. Spaces in filenames we probably have to live with.

Distributions might slowly make that option the default for new systems, sysadmins can opt-in faster themselves, unless they really have to deal with those few applications (which will hopefully disappear or become obsolete fast) that really, really want to create weird filenames.

Rethinking fsinfo()

Posted Aug 27, 2020 4:48 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Requiring valid UTF-8 is probably sensible for a new filesystem.
Excluding end-of-line characters is probably justifiable too. (or any control char ... I don't think we need TAB or DEL).
Anything else is parochial.
When I'm choosing a name to save my document from my GUI, why should I care about your inability to write safe shell scripts, or even have any understanding that "the shell" exists.
It is bad enough that I cannot put a '/' in my file names, why would you prevent me using '$'??

Rethinking fsinfo()

Posted Aug 29, 2020 11:29 UTC (Sat) by flussence (guest, #85566) [Link]

We're slowly getting there, the kernel has Unicode normalisation for filesystems at long last. I think we could live without ASCII control chars next, though I don't agree that we should forbid filenames from containing strings that an average person at a regular keyboard could type. Computers are meant to serve people, not the other way around.