|
|
Log in / Subscribe / Register

Rustaceans at the border

Rustaceans at the border

Posted Apr 19, 2022 6:34 UTC (Tue) by flussence (guest, #85566)
In reply to: Rustaceans at the border by khim
Parent article: Rustaceans at the border

I feel like Unicode could easily fit all the current semantics into 16 bits of codepoints with lots of room to space, if it were redone today as RISC (consistent use of combining sequences) instead of CISC (e.g. the entire precomposed CJK glyphs area, all the latin-with-extra-squiggles codeplanes).

Still wouldn't be anywhere near simple to use, but that's how written communication is.


to post comments

Rustaceans at the border

Posted Apr 19, 2022 11:06 UTC (Tue) by ssokolow (guest, #94568) [Link] (4 responses)

Unfortunately, it's not as simple as it sounds.

Wikipedia's commentary on Han Ideographs (Chinese, Japanese Kanji, Korean Hanja, etc.) being only offered precomposed is "However, attempts to do this for character encoding have stumbled over the fact that Chinese characters do not decompose as simply or as regularly as Hangul does."

...and I remember reading that the precomposed Latin stuff is necessary to guarantee that text strings used as opaque lookup keys (eg. filesystem paths) wouldn't get altered when round-tripping between a legacy encoding and Unicode, regardless of the circumstances.

Rustaceans at the border

Posted Apr 20, 2022 1:16 UTC (Wed) by flussence (guest, #85566) [Link] (1 responses)

I can accept that first point, but regarding latin I believe macOS munges every filename through NFD, so it's already a lost cause.

Rustaceans at the border

Posted Apr 20, 2022 3:42 UTC (Wed) by ssokolow (guest, #94568) [Link]

That's fine. Programs are supposed to assume that the filesystem may change under them. That's the cost of a shared resource which doesn't use Microsoft Visual SourceSafe-style locking overkill.

What's important is that, if the OS APIs give you a string identifier, your internal string processing can round-trip what you were given without altering it.

For comparison, I imagine that using the Windows version of Python's os.path.normcase for purposes other than in-process equality comparisons would cause i18n issues since it uses Python's internal .lower() method and thus the Unicode case-conversion tables baked into that version of Python while NTFS lookups use a case-folding table baked into the NTFS partition at the time it was formatted to ensure that Unicode updates can't introduce case-equivalence collisions for already-existing paths.

Rustaceans at the border

Posted Apr 22, 2022 10:24 UTC (Fri) by smurf (subscriber, #17840) [Link] (1 responses)

Assuming that there even is a legacy encoding that has composing codepoints *and* the corresponding composed characters.

And even if there is, you could mark the offenders, e.g. by placing a combining grapheme joiner U+034F between them.

IMHO the real reason is that, at the time, font rendering engines were not clever enough to show alternate glyphs for composed characters whose naïve supposition of their constituent parts simply doesn't work. (As in, all accented/umlauted/whatever'd capital letters.)

That, or the precedence of Latin-1 with its mountain of composed characters proved too strong and nobody even thought about solving the problem some other way until it was too late.

That, or the problem was deemed unfixable because instead of expanding Han-encoded texts by 50% (three-byte UTF-8 instead of two-byte words) you'd blow them up by >250% (two bytes for radical A, two for radical B, at least one for either marking the end of a glyph or a joiner; more if there's a radical C involved) which would not have been acceptable at the time. After all, at the time Weird Al chastised Microsoft that "in case you haven't noticed, four-gig drives don't grow on trees".

Rustaceans at the border

Posted Apr 22, 2022 13:48 UTC (Fri) by khim (subscriber, #9252) [Link]

> That, or the precedence of Latin-1 with its mountain of composed characters proved too strong and nobody even thought about solving the problem some other way until it was too late.

It's not even about the “precedence of Latin-1”. It's about the simple practical need to keep parts of your data in Unicode and parts in some other encoding with constant conversions between these.

It took years (about 10 to 20 years, in fact) before people, finally, stopped using legacy encodings.

If Unicode would have been impossible (or very hard and inefficient) to use in that fashion then it would have never taken off.

Size considerations were also quite real: Japan persisted for years with ISO-2022-JP both because roundtrip there is not perfect and also because it made documents 50% larger.

The only big issue with Unicode was initial assumption that 16-bit would be enough, after all: that prompted thus useless and very costly trip to USC-2 then UTF-16 and then, finally, to UTF-8.

USC-2 made sense but UTF-16 has all the problems of UTF-8 without giving you any benefits.

If people realized earlier that USC-2 wouldn't work then all that hoopla with two kinds of functions in Java, endless bugs with UTF-16 in browsers and other such things could have been avoided.

But oh, well, we can't change the path, can only adopt UTF-8 for the future.

Rustaceans at the border

Posted Apr 19, 2022 19:40 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

There are more than 100000 unique Han characters. They can usually be decomposed into simpler characters (radical + phonetic), but even with that simplification it's going to get uncomfortably close to 2^16 code points.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds