Is the current Unicode design impractical?

Posted Dec 20, 2015 19:37 UTC (Sun) by raiph (guest, #89283)
In reply to: Is the current Unicode design impractical? by butlerm
Parent article: Unicode, Perl 6, and You

Thanks, and 행운을 빕니다 [1]

Recapping, this exchange began with an assertion that EGC is problematic as a general basis for segmenting text in to characters. The exemplar was sorting Hangul (Korean) text.

Aiui:

> So if you go to all the trouble to allow elements of a string to include more than one codepoint

Text processing systems implementing Unicode compliant support for programmatic distinction of "what a user thinks of as a character" (quoting Unicode.org) in arbitrary Unicode text must deal with characters that include more than one codepoint.[2]

> perhaps it would be much better to leave what is combined and what isn't up to the user on a string by string basis.

It needs to be more fine grained than that in the sense that, according to the Unicode standard, the right way to segment any given string depends on what operation you're applying to that string as well as in what language/locale.

So the same string might, at least in principle, be segmented in one way for sorting and another for regexing under one locale setting and then in third and forth ways with another locale setting in effect.

> [perhaps one should] Allow programs to convert a string to NFG form, NFC form, NFD form, NFP form, or an unspecified form and have the result be able to be processed as such.

Unicode calls this notion of arbitrary (unspecified) forms "tailored grapheme clusters" (quoting Unicode.org).

http://cldr.unicode.org/development/development-process/d... may provide more insight. Among the bugs listed at the end, note in particular http://unicode.org/cldr/trac/ticket/2142 for "Alternate Grapheme Clusters" that was filed 7 years ago and switched from new to accepted status 7 months ago.

NFG is intended to be a general conceptual and implementation mechanism dealing with grapheme clustering in Perl 6 and Rakudo. If the current implementation of NFG doesn't already embrace customizable grapheme clustering similar to the aforementioned Alternate Grapheme Clusters, it surely will when the limits of EGC become clear and someone tackles the work that needs to be done.

I see your NFP idea[3] as an example of an alternate grapheme clustering which would be an NFG variant if it were implemented.

> I will think about trying this out, as a matter of curiosity if nothing else. Thanks.

If you find time I suggest you consider trying at least two text processing systems or programming languages whose developers *explicitly aspire to try to get grapheme clustering right* so you can compare them. Afaik the most mature ones are the CLDR / ICU projects and systems that build on them. The Rakudo Perl 6 compiler implementation[4] does *not* (directly) use CLDR/ICU; it may be a particularly interesting exercise to compare Rakudo results with those of another programming language which does use CLDR/ICU and claims to do grapheme clustering.

----

[1] Good luck in Korean according to translate.google.com :)

[2] One can not support programmatic manipulation of "what a user thinks of as a character" (quoting Unicode.org) in a Unicode compliant manner unless one does indeed go to all the trouble of allowing elements of a string to include more than one code point. No mainstream programming language has attempted to make this simple until very recently but that should not be taken as evidence that it isn't a functional requirement for mature Unicode compliance.

(Btw, you earlier suggested "codepoints for every letter equivalent in actual use, so that NFC is complete". Unfortunately "normalization form NFC (the composed form favored for use on the Web) is frozen" (quoting Unicode.org). So while they've frozen in place a set of precomposed graphemes for contemporary Hangul making the NFC shortcut work out, this approach breaks down when applied generally for arbitrary Unicode text.)

[3] (For other readers, NFP is a mooted Normalization Form Phoneme for use when sorting text, at least Hangul text.) butlerm, my review of Unicode.org documents triggered by this exchange has increased my confidence that a "grapheme" per the Unicode spec is a completely general concept ("A minimally distinctive unit of writing in the context of a particular writing system") and that the notion of a "formatting" unit that you've mentioned is just one example of a grapheme and that other units such as one corresponding to a phoneme are, in Unicode parlance, just graphemes of different flavors.

[4] The reference Perl 6 compiler Rakudo implements grapheme clustering via its MoarVM (http://moarvm.org) backend.