|
|
Log in / Subscribe / Register

Rustaceans at the border

Rustaceans at the border

Posted Apr 19, 2022 11:06 UTC (Tue) by ssokolow (guest, #94568)
In reply to: Rustaceans at the border by flussence
Parent article: Rustaceans at the border

Unfortunately, it's not as simple as it sounds.

Wikipedia's commentary on Han Ideographs (Chinese, Japanese Kanji, Korean Hanja, etc.) being only offered precomposed is "However, attempts to do this for character encoding have stumbled over the fact that Chinese characters do not decompose as simply or as regularly as Hangul does."

...and I remember reading that the precomposed Latin stuff is necessary to guarantee that text strings used as opaque lookup keys (eg. filesystem paths) wouldn't get altered when round-tripping between a legacy encoding and Unicode, regardless of the circumstances.


to post comments

Rustaceans at the border

Posted Apr 20, 2022 1:16 UTC (Wed) by flussence (guest, #85566) [Link] (1 responses)

I can accept that first point, but regarding latin I believe macOS munges every filename through NFD, so it's already a lost cause.

Rustaceans at the border

Posted Apr 20, 2022 3:42 UTC (Wed) by ssokolow (guest, #94568) [Link]

That's fine. Programs are supposed to assume that the filesystem may change under them. That's the cost of a shared resource which doesn't use Microsoft Visual SourceSafe-style locking overkill.

What's important is that, if the OS APIs give you a string identifier, your internal string processing can round-trip what you were given without altering it.

For comparison, I imagine that using the Windows version of Python's os.path.normcase for purposes other than in-process equality comparisons would cause i18n issues since it uses Python's internal .lower() method and thus the Unicode case-conversion tables baked into that version of Python while NTFS lookups use a case-folding table baked into the NTFS partition at the time it was formatted to ensure that Unicode updates can't introduce case-equivalence collisions for already-existing paths.

Rustaceans at the border

Posted Apr 22, 2022 10:24 UTC (Fri) by smurf (subscriber, #17840) [Link] (1 responses)

Assuming that there even is a legacy encoding that has composing codepoints *and* the corresponding composed characters.

And even if there is, you could mark the offenders, e.g. by placing a combining grapheme joiner U+034F between them.

IMHO the real reason is that, at the time, font rendering engines were not clever enough to show alternate glyphs for composed characters whose naïve supposition of their constituent parts simply doesn't work. (As in, all accented/umlauted/whatever'd capital letters.)

That, or the precedence of Latin-1 with its mountain of composed characters proved too strong and nobody even thought about solving the problem some other way until it was too late.

That, or the problem was deemed unfixable because instead of expanding Han-encoded texts by 50% (three-byte UTF-8 instead of two-byte words) you'd blow them up by >250% (two bytes for radical A, two for radical B, at least one for either marking the end of a glyph or a joiner; more if there's a radical C involved) which would not have been acceptable at the time. After all, at the time Weird Al chastised Microsoft that "in case you haven't noticed, four-gig drives don't grow on trees".

Rustaceans at the border

Posted Apr 22, 2022 13:48 UTC (Fri) by khim (subscriber, #9252) [Link]

> That, or the precedence of Latin-1 with its mountain of composed characters proved too strong and nobody even thought about solving the problem some other way until it was too late.

It's not even about the “precedence of Latin-1”. It's about the simple practical need to keep parts of your data in Unicode and parts in some other encoding with constant conversions between these.

It took years (about 10 to 20 years, in fact) before people, finally, stopped using legacy encodings.

If Unicode would have been impossible (or very hard and inefficient) to use in that fashion then it would have never taken off.

Size considerations were also quite real: Japan persisted for years with ISO-2022-JP both because roundtrip there is not perfect and also because it made documents 50% larger.

The only big issue with Unicode was initial assumption that 16-bit would be enough, after all: that prompted thus useless and very costly trip to USC-2 then UTF-16 and then, finally, to UTF-8.

USC-2 made sense but UTF-16 has all the problems of UTF-8 without giving you any benefits.

If people realized earlier that USC-2 wouldn't work then all that hoopla with two kinds of functions in Java, endless bugs with UTF-16 in browsers and other such things could have been avoided.

But oh, well, we can't change the path, can only adopt UTF-8 for the future.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds