|
|
Log in / Subscribe / Register

Rustaceans at the border

Rustaceans at the border

Posted Apr 22, 2022 10:24 UTC (Fri) by smurf (subscriber, #17840)
In reply to: Rustaceans at the border by ssokolow
Parent article: Rustaceans at the border

Assuming that there even is a legacy encoding that has composing codepoints *and* the corresponding composed characters.

And even if there is, you could mark the offenders, e.g. by placing a combining grapheme joiner U+034F between them.

IMHO the real reason is that, at the time, font rendering engines were not clever enough to show alternate glyphs for composed characters whose naïve supposition of their constituent parts simply doesn't work. (As in, all accented/umlauted/whatever'd capital letters.)

That, or the precedence of Latin-1 with its mountain of composed characters proved too strong and nobody even thought about solving the problem some other way until it was too late.

That, or the problem was deemed unfixable because instead of expanding Han-encoded texts by 50% (three-byte UTF-8 instead of two-byte words) you'd blow them up by >250% (two bytes for radical A, two for radical B, at least one for either marking the end of a glyph or a joiner; more if there's a radical C involved) which would not have been acceptable at the time. After all, at the time Weird Al chastised Microsoft that "in case you haven't noticed, four-gig drives don't grow on trees".


to post comments

Rustaceans at the border

Posted Apr 22, 2022 13:48 UTC (Fri) by khim (subscriber, #9252) [Link]

> That, or the precedence of Latin-1 with its mountain of composed characters proved too strong and nobody even thought about solving the problem some other way until it was too late.

It's not even about the “precedence of Latin-1”. It's about the simple practical need to keep parts of your data in Unicode and parts in some other encoding with constant conversions between these.

It took years (about 10 to 20 years, in fact) before people, finally, stopped using legacy encodings.

If Unicode would have been impossible (or very hard and inefficient) to use in that fashion then it would have never taken off.

Size considerations were also quite real: Japan persisted for years with ISO-2022-JP both because roundtrip there is not perfect and also because it made documents 50% larger.

The only big issue with Unicode was initial assumption that 16-bit would be enough, after all: that prompted thus useless and very costly trip to USC-2 then UTF-16 and then, finally, to UTF-8.

USC-2 made sense but UTF-16 has all the problems of UTF-8 without giving you any benefits.

If people realized earlier that USC-2 wouldn't work then all that hoopla with two kinds of functions in Java, endless bugs with UTF-16 in browsers and other such things could have been avoided.

But oh, well, we can't change the path, can only adopt UTF-8 for the future.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds