Unicode 15 released

Posted Sep 15, 2022 23:10 UTC (Thu) by NYKevin (subscriber, #129325)
In reply to: Unicode 15 released by devslashilly
Parent article: Unicode 15 released

That's nice, but String is still UTF-16 according to https://docs.oracle.com/javase/7/docs/api/java/lang/Strin..., and that's arguably a bigger problem than what charset the OS-level APIs use.

Meanwhile, Windows is never going to give up UTF-16, unless they have a full backcompat break. The problem, you see, looks roughly like this:

0. In the bad old days of everyone using a different ISO-8859 variant (except for East Asia, where they had a wildly different set of encodings because CJK scripts are huge), Windows came in different editions, and the code page was baked into the OS. All of their APIs would transparently use the OS-level code page, and if you wanted to support other code pages, too bad, it was impossible.
1. At some point, they decided that was too ugly, and modularized things to such an extent that you could install multiple locales on the same computer and switch between them. But the old APIs were still around, so the OS-level code page became the application code page, set by default to "whatever the active locale specifies," and the API grew a few functions for changing the code page if desired. Eventually, they also added manifest support so you could just do that at the packaging stage instead of having to write actual code for it.
2. The Unicode Consortium comes along and tells everyone "We're doing this great new encoding, it'll have all the languages and fit into 16 bits." Microsoft decides to go all-in on this, and deploys a brand-new set of APIs which are identical to the old APIs, except they use wchar instead of char, and only accept UTF-16 (native endianness, usually LE). The old char APIs are informally deprecated but continue to exist for backcompat reasons. Also, they introduce a whole bunch of preprocessor macros so that you can just code against the two APIs as if they were one API, not think about charsets at all, and then decide which API to use with a single global #define at build time.
3. Everyone figures out that 16 bits is not enough, and surrogate pairs are born. Microsoft bolts on surrogate pair checking to their existing UTF-16 APIs and calls it a day. (Technically, before surrogate pairs existed, it was called "UCS-2" rather than UTF-16, but I have avoided using that name to prevent confusion. It's exactly the same encoding, other than the existence of surrogate pairs.)
4. Everyone figures out that UTF-8 is far superior to UTF-16. After much hemming and hawing, Microsoft adds a code page for UTF-8, as if it's just another legacy encoding, but eventually they update their documentation to vaguely suggest that maybe using the char API with the UTF-8 code page is better in some circumstances. Also, they make the UTF-8 code page the system-level default in all locales, but that only affects the char API, because the wchar API never used code pages in the first place (it's hard-coded to UTF-16 and always has been).
5. So now, the wchar functions must continue to exist, and must continue to use UTF-16, or else lots of applications will stop working. Microsoft *could* tell everyone to recompile against the char functions with a UTF-8 code page, and then drop wchar support, breaking everyone who didn't recompile, but they are not Apple and can't get away with doing something like that. Also, the wording of their documentation strongly suggests that much of the NT codebase uses UTF-16 internally and that it's the "native" encoding of modern Windows, so changing the APIs would be putting lipstick on a pig anyway.

Unicode 15 released

Posted Sep 16, 2022 0:34 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> Meanwhile, Windows is never going to give up UTF-16, unless they have a full backcompat break.

Windows allows you to use the "-A" functions with UTF-8 now. It just internally translates the strings into WTF-16.

It's possible to flip this around to using UTF-8 internally and translating the WTF-16 into UTF-8 on the system library border. This will have impact on kernel-level drivers, but even there a compat layer can provide a smooth transition.

This is also made easier because of microkernel-ish design of Windows, where most ioctls/syscalls are actually done via a sort of message passing. So translation can be done in a central location that is fairly straightforward to maintain.

Unicode 15 released

Posted Sep 16, 2022 18:16 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (5 responses)

Yes, they can replace the internal guts of NT, if they really want to, but they can't get rid of the -W functions without forcing everyone to recompile. Which means that UTF-16 (I'm not aware of a WTF-16 encoding, although I know that WTF-8 is a thing) has to continue existing.

Furthermore, when the -W functions were introduced, the whole point of them was to cover all of Unicode without having to think about code pages ever again. Therefore, it was reasonable for applications at the time to assume that (for example) you could take the string you get from FindFirstFileW/FindNextFileW and pass it directly to CreateFileW, and that everything will round-trip correctly no matter what that string looks like.* Any future version of Windows has to preserve that invariant, which means that future versions of Windows cannot allow invalid-in-UTF-16 characters in filenames (or they can, but then they have to do some weird hack like the old "short file names" tilde nonsense).

* Obviously, there are TOCTTOU concerns here, but I'm assuming that you're not implementing a security boundary and do not actually care to support the use case where some random other process unexpectedly stomps on your %APPDATA% subdirectory.

Unicode 15 released

Posted Sep 16, 2022 20:11 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> FindFirstFileW/FindNextFileW and pass it directly to CreateFileW
I remember reading that Windows is starting to enforce at least some sanity in CreateFileW and doesn't allow some of the more malformed names. And that's discounting gotchas like "aux.txt".

Unicode 15 released

Posted Sep 16, 2022 20:17 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (3 responses)

That's beside the point. If you got it from FindFirstFileW in the first place, then it must have been a valid name to begin with, and so you can pass it to CreateFileW.

(Technically, you can create files with invalid names by prefixing the absolute path with \\?\, and then this may fail to work. But that's because \\?\ is the explicit "I know what I'm doing is not valid in the NT universe, just let me do it anyway" magic word, and when you use it, lots of stuff breaks. Barring that sort of chicanery, this always works.)

Unicode 15 released

Posted Sep 16, 2022 20:33 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> If you got it from FindFirstFileW in the first place, then it must have been a valid name to begin with, and so you can pass it to CreateFileW.

That's not true even now. You can create an NTFS filesystem on Linux with incorrect file names, for example. Or it might be a network file system with limits that are different from what you'd expect. Barring that, reparse points might cause some files to be "unfindable".

So in practice the symmetry between FindFirstFile and CreateFile is already kinda wobbly.

Unicode 15 released

Posted Sep 16, 2022 21:35 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

Yeah, but those are all really unlikely in practice, so nobody's going to bother checking for them anyway. OTOH, "The user's name contains a character that is not in the current code page, and so everything under C:\Users\<name> is inaccessible and/or requires the use of a tilde name hack that will look ugly in your UI" is a much bigger problem... but if you're all-in on the -W functions, you probably assume(d) that you were safe from that.

Unicode 15 released

Posted Nov 8, 2022 2:36 UTC (Tue) by vtjnash (subscriber, #141755) [Link]

They could use WTF-8 encoding instead. It is a superset of UTF-8 that also supports round-trip from malformed UTF-16. (The reverse is not fully true, since it can yield different results if two such WTF-8 strings are concatenated and end up yielding a well-formed UTF-16 string after conversion)

Unicode 15 released

Posted Sep 18, 2022 6:44 UTC (Sun) by jond (subscriber, #37669) [Link] (2 responses)

> That's nice, but String is still UTF-16 according to https://docs.oracle.com/javase/7/docs/api/java/lang/Strin...,

I don’t know whether this is fixed now or not, but that page won’t tell you, because it’s for Java SE 7, and OP was talking about the recently released Java SE 18.

Unicode 15 released

Posted Sep 18, 2022 8:50 UTC (Sun) by ABCD (subscriber, #53650) [Link]

The Java 18 docs for that class at https://docs.oracle.com/en/java/javase/18/docs/api/java.b... seem to indicate that this hasn't changed, it's still UTF-16.

Unicode 15 released

Posted Sep 18, 2022 8:55 UTC (Sun) by dtlin (subscriber, #36537) [Link]

char being a 16-bit value is hard-baked into the JVM, and thus anything that uses a char[] is inherently operating on UTF-16.

Java 9 did add the +XX:+CompactStrings option (JEP 254), which changed the internal representation of String from char[] to byte[], along with a bit determining whether that representation is Latin-1 or UTF-16, with the former taking up half the space. But there was no change to the user-visible API, it is only an implementation detail.

(Java 9 did add String#codePoints() returning an IntStream of code points, but it's unrelated and you could have implemented that yourself with codePointAt()+offsetByCodePoints() anyway, it's just more convenient.)