|
|
Subscribe / Log in / New account

Unicode, Perl 6, and You

Day 7 in the ongoing Perl 6 advent calendar is concerned with how the language handles Unicode. "However, Perl 6 does this work for you, keeping track of these collections of codepoints internally, so that you just have to think in terms of what you would see the characters as. If you’ve ever had to dance around with substring operations to make sure you didn’t split between a letter and a diacritic, this will be your happiest day in programming."

(Log in to post comments)

Unicode, Perl 6, and You

Posted Dec 7, 2015 18:51 UTC (Mon) by b7j0c (subscriber, #27559) [Link]

Nice! I am finally ready to accept unicode in source code files, and perl6 looks like a good opportunity to start. Its been hard to tear myself away from learning perl6...there are just so many neat little nuggets. I don't care how it is derided or if anyone else uses it...I'm having a ton of fun with it so far.

Unicode, Perl 6, and You

Posted Dec 7, 2015 19:27 UTC (Mon) by troy.unrau (guest, #73654) [Link]

I've been using unicode as my source format for python for a while. Couldn't be happier! Glad perl is going down the same road, as it's totally worth it. I do a lot of math and physics, and there's always greek letters (think π, etc.). It's useful to refer to equations from texts verbatim in the code without having to munge everything into ASCII first - especially in comments!

Unicode, Perl 6, and You

Posted Dec 8, 2015 2:41 UTC (Tue) by pabs (subscriber, #43278) [Link]

It will be interesting to people entering those obfuscated/evil code programming contests too.

Unicode, Perl 6, and You

Posted Dec 8, 2015 7:31 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Yeah. So we can have gems like this one: http://corejavaint.blogspot.com/2015/10/weired-questions-... (question one)

Unicode, Perl 6, and You

Posted Dec 13, 2015 1:05 UTC (Sun) by butlerm (guest, #13312) [Link]

Extended grapheme cluster seems like a remarkably impractical basis to form the foundation of a language's default text processing system. Besides the efficiency issues, no one spells or sorts that way in real life. If you ask anyone to spell one of these things - however they are formatted on the display - what you get in almost all cases is a sequence of generally phonetic elements, and almost any sane sorting scheme will proceed lexicographically along the same lines.

Phoneme bears a much closer relationship to Unicode code point (for most languages), with the exception of a number of combinations of the latter that probably should have been rendered as independent code points in the first place, the way they typically have been in earlier character sets. The letter 'a' plus diaeresis for example should probably never be split at all, and it is arguably some kind of mistake that Unicode insists that most of these things be represented as code point sequences rather than defining precomposed, phonetic equivalent, lexicographically sortable codepoints in the first place.

Then a language like Python that pragmatically insists on dealing with codepoints rather than grapheme clusters as the basis for string processing would be more straightforward to use for general purpose work, and the designers of languages like Perl 6 would perhaps not be so inclined to go to such extreme lengths to achieve a similar or more exotic effect.

Unicode, Perl 6, and You

Posted Dec 13, 2015 2:46 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

> The letter 'a' plus diaeresis for example should probably never be split at all, and it is arguably some kind of mistake that Unicode insists that most of these things be represented as code point sequences rather than defining precomposed, phonetic equivalent, lexicographically sortable codepoints in the first place.
How would you sort CJK characters, then? By the number of strokes, by radical, by their four-corners string?

Unicode, Perl 6, and You

Posted Dec 13, 2015 18:50 UTC (Sun) by butlerm (guest, #13312) [Link]

Chinese characters are a single codepoints that are easily sorted in any order you care to adopt using a lookup table of some sort. Naturally there are local standards for that sort of thing based on radicals, number of strokes, etc. The problem is sorting or processing graphemes that are a series of codepoints - it almost impossible to do efficiently. Normalization Form C where common combinations are reduced to single codepoints is incredibly useful for that reason. Processing something in Normalization Form D where everything is broken out is remarkably unnatural for the sort of general purposes things like regular expressions were designed for.

It is hard to imagine that treating extended graphemes consisting of a series of codepoints as the base unit of string processing in a language is ever going to be successful, whether on efficiency grounds or on awkwardness grounds or both. Extended graphemes in many cases are more like words rather than letters.

Hangul (Korean) extended graphemes are typical. They are composed of a series of codepoints each of which corresponds to a phoneme. They sort lexicographically by the component phonemes or 'letters' not by extended grapheme (or syllable). So if you have a choice, why in the world would you want to use syllables as the base unit of string processing rather than codepoints that correspond to simple letters or phonemes?

Unicode, Perl 6, and You

Posted Dec 14, 2015 9:07 UTC (Mon) by raiph (guest, #89283) [Link]

> The problem is sorting or processing graphemes that are a series of codepoints - it almost impossible to do efficiently.

Perl 6 devs (and presumably Swift devs) are treating this as a problem that programming languages can overcome rather than a fundamental mistake in Unicode. (A related Perl 6 motto is "Torture the implementors on behalf of the users.".)

> Normalization Form C where common combinations are reduced to single codepoints is incredibly useful for that reason.

If you rely on the fact that common combinations NFC normalize to a single codepoint per grapheme then your code will go wrong when it encounters uncommon combinations.

> Processing something in Normalization Form D where everything is broken out is remarkably unnatural for the sort of general purposes things like regular expressions were designed for.

I'm not sure why you raised this, but fwiw Perl 6 does not make odd bedfellows like NFD and regexes do anything unnatural with each other. (Perl 6 sub string processing is impressively simple overall UI/api wise even when mixing and matching strings using the default string type (Str/graphemes), Unicode strings (Uni/codepoints), buffers (Buf/bytes), NFD and NFC normalizations, regexes, and so on.)

> It is hard to imagine that treating extended graphemes consisting of a series of codepoints as the base unit of string processing in a language is ever going to be successful

I think you aren't imagining hard enough. :)

> whether on efficiency grounds or on awkwardness grounds or both.

I can see leveling a memory efficiency complaint at Rakudo Perl 6. It incorporates a Perl 6 invention, an internal mechanism called NFG, short for "Normalization Form Grapheme". The current version of NFG results in 4 bytes per character in RAM. (Iirc an optimization that stored Latin-1 text in compact form was recently removed but may return.)

I can see leveling a speed complaint at both Rakudo Perl 6 and Swift. But in the Perl 6 case this only rings true because Rakudo is so slow in so many regards. The NFG design is actually an exception -- it's speed is a strength, not a weakness, as it has an O(1) string indexing design (unlike Swift's).

Your awkwardness complaint doesn't ring true at all for Perl 6. The whole point of NFG (and the Uni and Buf types etc.) was to contribute toward making basic Perl 6 sub string processing simple. This plan has worked out very nicely (imo) so far.

> Extended graphemes in many cases are more like words rather than letters.

The Unicode standard specifies use of custom (tailored) grapheme clustering instead of general purpose EGC when EGC isn't appropriate.

That said, I'm pretty sure that the upcoming 6.c release won't have any custom grapheme clustering.

In the meantime one might use NFC to process strings at that level instead. (Or perhaps one might use Inline::Perl5 to run Perl 5 code within Perl 6. Perl 5 has really good Unicode functionality even if it's relatively complicated UI/api wise.)

> So if you have a choice, why in the world would you want to use syllables as the base unit of string processing rather than codepoints that correspond to simple letters or phonemes?

Imo this conclusion/question represents a misunderstanding of Unicode; of graphemes, grapheme clustering, EGC, and tailored grapheme clusters; of the role of NFC; and perhaps of Perl 6 too.

Unicode, Perl 6, and You

Posted Dec 14, 2015 18:54 UTC (Mon) by butlerm (guest, #13312) [Link]

> This plan has worked out very nicely (imo) so far.

I suggest that the Perl 6 devs ask enough people that use relevant foreign languages their opinion about it first. For example, in Hangul (Korean), an extended grapheme is made of a sequence of up to five 'letters', and sorting is done by letter, not by grapheme. Grapheme (or syllable) is just a formatting thing, having essentially nothing to do with the way words are entered, spelled, sorted, or otherwise processed.

'Hangul' in Hangul contains six letter equivalents, an 'h', an 'a', an 'n', a 'g', a 'u', an 'l', is sorted that way in a Korean dictionary, and is only written as two graphemes for historical reasons - basically to match the size of inline Chinese characters. If you go around treating arbitrary combinations of letters as the base unit of string processing, you will make everything unnecessarily complicated, comparable to the adoption of several thousand English syllables as base units that everyone is forbidden to break into component letters. It is a disaster waiting to happen.

Unicode, Perl 6, and You

Posted Dec 14, 2015 21:21 UTC (Mon) by Fowl (subscriber, #65667) [Link]

> 'Hangul' in Hangul contains six letter equivalents, an 'h', an 'a', an 'n', a 'g', a 'u', an 'l', is sorted that way in a Korean dictionary, and is only written as two graphemes for historical reasons

I may be being dense, but I why is this a problem? If it is exactly equivalent to 'letters' than it wouldn't sort any differently to those letters. The advantage being that a string splitting/truncating/concatenation operation would operate visually/word wise instead of letter wise (and arguably something that would benefit Latin scripted text).

> words are entered, spelled, sorted, or otherwise processed

Ah, I think I see the conflict here - between two classes of operations: display/opaque processing (ie. split/concat/truncate) and more language specific operations like you've mentioned.

> unnecessarily complicated, comparable to the adoption of several thousand English syllables as base units

The implementation can still see the components ;)

> that everyone is forbidden to break into component letters.

And so can everyone else, if they need/want to.

---

I don't know anything about anything, but this seems very 'perly' at least.

Unicode, Perl 6, and You

Posted Dec 15, 2015 17:19 UTC (Tue) by butlerm (guest, #13312) [Link]

> I may be being dense, but I why is this a problem?

Per the article Perl 6 is going to process text using Extended Grapheme Cluster as the base unit of string processing. In a language like Korean (hangul) that means treating a series of up to five letters as a single element. That is a rather unnatural way of going about anything of the kind in a large number of languages, comparable to making syllables rather than letters in Latin languages the base unit of string processing.

To take a simple example, how do you compare two strings in such a representation? You can't do it strcmp style on ECGs - you have to break it back down into letters or rather "collation graphemes" and then you can do the comparison. For many languages you would be better off not having combined the codepoints into Extended Grapheme Clusters at all, because ECGs have nothing to do with sorting. They are a formatting construct.

Unicode, Perl 6, and You

Posted Dec 15, 2015 1:58 UTC (Tue) by raiph (guest, #89283) [Link]

> I suggest that the Perl 6 devs ask enough people that use relevant foreign languages their opinion about it

I'm not sure what "it" you are talking about but they've been talking through Unicode issues for 15 years, and will presumably do so for another 15, so I'm sure they'd love to hear more such views (though not this side of christmas). Indeed, one of the things I am hoping will emerge from this exchange is something useful to bring to #perl6 attention.

> For example, in Hangul (Korean), an extended grapheme is made of a sequence of up to five 'letters', and sorting is done by letter, not by grapheme.

Aiui, if sorting is done by letter, then the letter is a grapheme for that purpose. A different grapheme than one that visually comprises five letters, but a grapheme nevertheless. That's what tailored grapheme clusters are about. At least, that's what my understanding is.

> Grapheme (or syllable) is just a formatting thing

The cool thing is that this should be simple to resolve. Either you're right about the Unicode standard definition of grapheme and the Unicode recommendations are broken and I've misunderstood how broken Unicode is, or you have misunderstood the Unicode standard's use of the term "grapheme", the phrase "grapheme cluster", and the respective roles of EGCs, tailored grapheme clusters, and so on.

I could well believe the former. Perhaps you can point me to discussions explaining how this aspect of Unicode's design and recommendations are problematic in the manner you describe; that would be very helpful.

Unicode, Perl 6, and You

Posted Dec 15, 2015 2:40 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Do you want to see something completely different?

Try https://en.wikipedia.org/wiki/Devanagari#Conjuncts - clusters of multiple individual characters, accent marks, all of the fun.

Personally, I don't see how grapheme-based decomposition is superior to simple plain UTF-8.

Unicode, Perl 6, and You

Posted Dec 16, 2015 6:13 UTC (Wed) by raiph (guest, #89283) [Link]

> Try https://en.wikipedia.org/wiki/Devanagari#Conjuncts - clusters of multiple individual characters, accent marks, all of the fun.

Yeah. It gets crazy. I've been trying to grok Unicode since the late 90s and really digging in to the Unicode.org documents and fora in recent years. The almost infinite complexity is *still* almost overwhelming.

> Personally, I don't see how grapheme-based decomposition is superior to simple plain UTF-8.

That's like comparing apples and oranges, or maybe atoms and molecules.

A UTF-8 string is a sequence of code points. Sticking with Devanagari:

say "षि".encode("utf8") # displays utf8:0x<e0 a4 b7 e0 a4 bf>
say "षि".encode("utf8").elems # displays 6, the count of elements

The grapheme clustering corresponding to a single displayed character finds just 1 character:

say "षि".chars # displays 1, the count of characters

So grapheme clustering is not superior. It's different. UTF-8 is an encoding expressed as code points. Grapheme clustering clusters code points in to "what a user thinks of as a character" or "user-perceived characters" for various definitions of "user" and "perceived" and "character".

For some languages (or rather strings) and string operations there's no difference:

say "Foo".encode("utf8") # displays utf8:0x<46 6f 6f>
say "Foo".encode("utf8").elems # displays 3, the count of elements
say "Foo".chars # displays 3, the count of characters

But, as the Unicode standard says:

> It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point.

and

> Grapheme cluster boundaries are important for collation, regular expressions, UI interactions (such as mouse selection, arrow key movement, backspacing), segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text.

--------

For many programming tasks there's no need to collate strings, or apply a regex, or count characters. And some languages are lucky enough to have their characters be mapped 1-to-1 to assigned code points so collation, regexing, and counting characters just work as one would expect anyway.

But when that isn't the case, either the programming language or the programmer (possibly by just `use`ing a module) has to do the work to calculate the "characters" in a string that correspond to a given string operation, or use relatively clumsy regex extensions, or whatever. The Perl 6 language design intent is to relieve the programmer of the burden of doing this extra work, delivering string operations that are about the same complexity as they are in most programming languages for ASCII.

Unicode, Perl 6, and You

Posted Dec 16, 2015 18:29 UTC (Wed) by butlerm (guest, #13312) [Link]

> say "षि".encode("utf8") # displays utf8:0x<e0 a4 b7 e0 a4 bf>

Is that the UTF8 NFC encoding or the UTF8 NFD encoding? Surely there is a standard when converting a NFG string into anything Uni or UTF ish?

Unicode, Perl 6, and You

Posted Dec 17, 2015 2:36 UTC (Thu) by raiph (guest, #89283) [Link]

> Is that the UTF8 NFC encoding or the UTF8 NFD encoding?

Unless one specifies otherwise, Perl 6 is *supposed* to normalize a text string to NFC when it's not NFG.

A normalization reporting tool I use to check things (when I'm not being too lazy) gives very different results http://demo.icu-project.org/icu-bin/nbrowser?t=&s=093... Other sources confirm the ICU project's tool's result. I plan to investigate further and report back in a day or three when I get to the bottom of this but didn't want to delay letting you know that Rakudo's encoded result in my example currently looks to me like gobbledygook.

---

I'm hoping this hasn't invalidated the point I intended to make. What I was speaking of is my understanding of Unicode's design. This has nothing to do with either the Perl 6 language design, or the Rakudo implementation, bugs or no bugs.

Unicode, Perl 6, and You

Posted Dec 17, 2015 17:12 UTC (Thu) by raiph (guest, #89283) [Link]

> Is that the UTF8 NFC encoding or the UTF8 NFD encoding?

For this particular character the four Unicode normalizations all result in the same codepoints prior to the encoding into UTF-8:

< षि > .NFC .say
NFC:0x<0937 093f>

< षि > .NFKC .say
NFD:0x<0937 093f>

< षि > .NFD .say
NFKC:0x<0937 093f>

< षि > .NFKD .say
NFKD:0x<0937 093f>

See my previous comment for verification via another tool. (And ignore my bogus "goobledygook" comment.)

Unless one specifies otherwise, Perl 6 normalizes a text string to NFC when it's not NFG.

Unicode, Perl 6, and You

Posted Dec 18, 2015 2:41 UTC (Fri) by raiph (guest, #89283) [Link]

> A UTF-8 string is a sequence of code points.
> UTF-8 is an encoding expressed as code points

I'll correct myself in case anyone else ever reads this thread. A UTF-8 string is an encoding in to bytes of a sequence of code points.

It's clearly time for me to stop posting in this thread. :)

Is the current Unicode design impractical?

Posted Dec 14, 2015 5:41 UTC (Mon) by raiph (guest, #89283) [Link]

(All "quoted text" below comes from Unicode.org docs. The rest is just aiui.)

> Extended grapheme cluster seems like a remarkably impractical basis to form the foundation of a language's default text processing system.

The Unicode standard uses the term grapheme to refer to a "user-perceived character". More concretely, if you can see the letter `a` (or `षि` to pick a fancier language) in this sentence, then, in Unicode terminology, you are seeing a grapheme. Graphemes are the atoms of written human language.

The phrase "grapheme cluster" was coined to reflect the fact that a grapheme "may not be just a single Unicode code point." This is fundamental to Unicode's design.

Extended grapheme clustering (EGC) is a particular set of rules for programmatically identifying grapheme clusters in arbitrary Unicode text. It is not adequate for all forms of text processing but it is both "recommended for general processing" and the simplest path to basic Unicode compliance in regard to graphemes.

> Besides the efficiency issues

Are you referring here to theoretical efficiency issues related to grapheme clusters? Or specific efficiency problems observed with actual existing software implementing grapheme clustering?

> no one spells or sorts that way in real life

?

If I'm sorting two strings "a" and "b" then I perceive the two distinct graphemes "a" and "b" and sort the former ahead of the latter because I use something akin to a rule that "a" comes before "b". It's all about identifying individual characters and sorting by first characters, then the second ones, and so on.

Relating this back to EGC:

* Depending on the language/locale, EGC may be sufficient for some higher level text processing functions such as defining an appropriate default lexicographic order for collation.

* The recommended path when this isn't the case is to create tailored grapheme clustering, which should be a customization of EGC, and/or add CLDR driven processing.

> Phoneme bears a much closer relationship to Unicode code point (for most languages)

A codepoint is just "Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF" (hex). This covers everything that goes in to "text" including control characters, diacritics, emojis, flags, you name it.

Some code points would work for establishing sort order. But in most such cases that's because the codepoint maps to a grapheme. Other code points such as a diacritic would be entirely inappropriate. More specifically, ones that don't correspond to a grapheme.

> The letter 'a' plus diaeresis for example should probably never be split at all, and it is arguably some kind of mistake that Unicode insists that most of these things be represented as code point sequences rather than defining precomposed, phonetic equivalent, lexicographically sortable codepoints in the first place.

When I first explored Unicode I too thought "why not just have code point = grapheme?"

One thing is for sure: the decision to distinguish code points and graphemes is a central characteristic of the Unicode standard.

> Then a language like Python that pragmatically insists on dealing with codepoints rather than grapheme clusters as the basis for string processing would be more straightforward to use for general purpose work, and the designers of languages like Perl 6 would perhaps not be so inclined to go to such extreme lengths to achieve a similar or more exotic effect.

Sure. But even if the Unicode consortium were open to this fundamental change (I don't think they are), technical limitations (code point range is limited to 0 to 10FFFF hex) or at least political limitations (which languages should fail to get graphemes for all their characters?) all but guarantee that this isn't going to happen.

Is the current Unicode design impractical?

Posted Dec 14, 2015 19:39 UTC (Mon) by butlerm (guest, #13312) [Link]

> Graphemes are the atoms of written human language.

This isn't true, as a cursory examination of a handful of relevant languages will establish. The atoms of any written phonetic language are phonemes - letters or letter equivalents. In a dozen or more languages, a grapheme is composed of multiple letters, is entered that way, sorted that way, and processed that way. If a series of letters happen to combine into a cluster for formatting purposes that is no more significant than a stylistic abbreviation or a ligature.

> Are you referring here to theoretical efficiency issues related to grapheme clusters? Or specific efficiency problems observed with actual existing software implementing grapheme clustering?

Suppose you want element n of a long string. Most languages use an index lookup for that sort of thing. If the elements are variable width, the entire string needs to be scanned or pre-processed to look up element n. If you pre-process your elements are potentially arbitary length strings in and of themselves, unless you impose a limit somewhere, of the sort that might lead to loss of information. With a limit your graphemes might be six or seven 'letters' max. Your string is then an array of variable length length-limited strings. Not exactly a prescription for efficiency.

> If I'm sorting two strings "a" and "b" then I perceive the two distinct graphemes "a" and "b" and sort the former ahead of the latter because I use something akin to a rule that "a" comes before "b".

That isn't true is a large number of languages, where graphemes are composed of multiple distinctly identifiable letters. 'ip' (입) and 'mal' (말) in a language like Korean for example are each one grapheme composed of three 'letters'. They are entered on a keyboard by letter, they sort in a dictionary by letter. No one sorts by grapheme, that is just a formatting convention. There is no point. Text processing is done by letter. The hangul equivalents of 'i', 'p', 'm','a', and 'l' respectively. There probably isn't a way to sort by grapheme without breaking the grapheme back up into letters because no one does it.

>Sure. But even if the Unicode consortium were open to this fundamental change (I don't think they are), technical limitations (code point range is limited to 0 to 10FFFF hex) or at least political limitations (which languages should fail to get graphemes for all their characters?) all but guarantee that this isn't going to happen.

I am not suggesting that anyone adopt codepoints for every possible grapheme. That would be comparable to suggesting that codepoints be adopted for every possible word. Sheer insanity. What would be useful is codepoints for every letter equivalent in actual use, so that NFC is complete. One codepoint per 'collation grapheme' is basically the same idea, albeit a bit of a misnomer.

That is probably mostly done anyway. It can't be done for ideographs, it has been done for Latin languages, it generally doesn't need to be done for languages that are already coded by letter-equivalent anyway, and so on.

Is the current Unicode design impractical?

Posted Dec 15, 2015 2:12 UTC (Tue) by raiph (guest, #89283) [Link]

> > Graphemes are the atoms of written human language.
>
> This isn't true, as a cursory examination of a handful of relevant languages will establish. The atoms of any written phonetic language are phonemes - letters or letter equivalents.

Aiui the Unicode standard's definition of grapheme is based on the normal definition. Quoting wikipedia:

> A grapheme is the smallest unit used in describing the writing system of any given language, originally coined by analogy with the phoneme of spoken languages.

I can't tell which one of us has misunderstood Unicode's use of grapheme but at least one of us has. :)

> Suppose you want element n of a long string.

Sure. Perl 6 indexing for such things (whether grapheme or codepoints or bytes) is O(1).

> If the elements are variable width

So don't do that.

> If you pre-process your elements are potentially arbitary length strings in and of themselves, unless you impose a limit somewhere, of the sort that might lead to loss of information. With a limit your graphemes might be six or seven 'letters' max.

The only limits I know of with MoarVM's NFG implementation are related to fairly extreme buffering scenarios and avoidance of totally filling RAM. Neither are as you suggest.

> Your string is then an array of variable length length-limited strings. Not exactly a prescription for efficiency.

Nah. NFG doesn't make that mistake.

> What would be useful is codepoints for every letter equivalent in actual use, so that NFC is complete. One codepoint per 'collation grapheme' is basically the same idea, albeit a bit of a misnomer.

But isn't that pretty much precisely what NFG is?

Is the current Unicode design impractical?

Posted Dec 15, 2015 18:16 UTC (Tue) by butlerm (guest, #13312) [Link]

Sorry, I should have said "Extended Grapheme Cluster" here. Per the article Perl 6 is slated to use ECGs as the base unit of string processing. That is a sequence of up to five letters in a language like Korean.

> Sure. Perl 6 indexing for such things (whether grapheme or codepoints or bytes) is O(1).

Glad to hear that.

> But isn't that pretty much precisely what NFG is?

Not really. ECGs are not letter-equivalents. They are more like a mixture of letter equivalents and syllable equivalents. In the latter case I believe that is a mistake in all cases where the syllables are naturally understood, processed, and coded as a series of independent letters. Perhaps the users of the languages in question feel differently.

Is the current Unicode design impractical?

Posted Dec 16, 2015 4:15 UTC (Wed) by raiph (guest, #89283) [Link]

OK. So, to repeat your earlier description of a recipe for success:

>>> What would be useful is codepoints for every letter equivalent in actual use, so that NFC is complete. One codepoint per 'collation grapheme' is basically the same idea, albeit a bit of a misnomer.

As I think I've explained already, I long ago concluded that something like that (but it may need to be more realistic; see next paragraph) wasn't technically and politically viable. But I'd love to discover that I'm wrong. Is there a credible movement within the Unicode consortium, or outside of it, to make something like that happen?

Aiui, there are other graphemes than graphemes for collation. Aiui the Unicode notion of grapheme ("minimally distinctive unit of writing") embodies the notion that the unit varies according to which string operation is being done. Collation is one operation, and implies collation units, but there are other operations that work on different units. So you'd need a set of 'collation grapheme' codepoints and a set of 'UI selection grapheme' codepoints and so on for various operations.

Additionally, these can vary per language and locale.

----------------------

If we focus on the desirability of turning all graphemes (just all 'collation graphemes' if that is sufficient and desirable) in to code points, the big two questions are:

Q1. Is enough unused (or perhaps poorly used and considered reclaimable) code point space (with some breathing room) still available to assign these various grapheme codepoints for all pertinent graphemes? (Perhaps just 'collation graphemes'. Or 'UI selection graphemes' etc. also if deemed necessary/desirable.)

(I thought the answer to this first question was no but I drew that conclusion many years ago when I was even more confused about Unicode than I am now and relied on the Unicode specification documents themselves and on what others said in the Unicode.org fora.)

Q2. Is there sufficient goodwill and motivation among Unicode consortium members and technical participants to successfully carry out the (perhaps arduous?) process of deciding which graphemes are pertinent (some graphemes not yet in use could reasonably be argued to be pertinent etc.)?

----------------------

Aiui NFG (Normal Form Grapheme) is intended to (one day, when enough grapheme clustering work has been done) deliver pretty much exactly the same results (modulo implementation complexity and performance challenges) but without requiring the additional code point assignments you propose or any other change to the Unicode standard.

----------------------

>> But isn't that pretty much precisely what NFG is?

> Not really. ECGs are not letter-equivalents.

Aiui the Perl 6 NFG design concept is more about creating a simple user experience for dealing with graphemes and grapheme clustering for all string and substring operations, not so much specifically UI-selection-units, say, or EGC. But, aiui, the Unicode standard recommends EGC as the starting point for "general processing" and that's the starting point chosen by Rakudo devs, even if it means one needs to use the `.NFC` method on strings in the interim to get desirable sorting of Hangul text.

> Perhaps the users of the languages in question feel differently.

Fwiw I'd say Larry is a champion of humans over tech and of listening to the needs and wants of ordinary folk. He especially cares a lot about written human languages. Iirc, before he settled on doing stuff like inventing Perl he was planning to go to Africa to transcribe in to written form some native languages that had no written form for christ's sake. Have you chatted with Larry about this stuff? He's friendly and fun to talk to. Why not visit the freenode IRC channel #perl6 and use the online evalbots to demonstrate the problems with sorting Hangul text?

If the Perl 6 project decides/realizes that NFG is a mistake, either because it reflects a deep misunderstanding of Unicode, or because Unicode changes direction, or because people en masse reject the grapheme architecture of full Unicode, Perl 6 will presumably recover but it'll be quite the blow.

(Swift, which has also adopted character=grapheme, will presumably be in the same boat. Indeed, I'm curious if you have pretty much the same view of Swift's string/character types.)

Conversely if NFG is basically the right idea, then it's plausible it will mature in to one of several anticipated advantages of Perl 6 over older programming languages.

----------

PS. I apologize for what looks like a mistake in my previous comment in this thread branch:

>> Your string is then an array of variable length length-limited strings. Not exactly a prescription for efficiency.

> Nah. NFG doesn't make that mistake.

'Nah.'? Apologies for my hypnogogic vernacular. :)

I think you were right. Aiui an NFG string is an array of 32 bit values. Each value is either a (positive integer) single codepoint or a (negative integer) synthetic that encodes a pointer (and length?) indirection to a string (code point sequence). Which I think means more or less the same as you were saying. My apologies.

Is the current Unicode design impractical?

Posted Dec 16, 2015 21:28 UTC (Wed) by butlerm (guest, #13312) [Link]

I appreciate the suggestion and the explanations, thanks. It helps to have some idea of how this is going to be implemented internally.

My general point is that NFD, NFC, and NFG are awkward forms for operations like string matching. NFD combines too little, NFC combines too much in some cases and not enough in others, and NFG combines too much. As a consequence a simple regular expression in NFC, NFD, or NFG cannot tell which strings in an arbitrary language start with a separable letter-equivalent.

NFC and NFD can tell you which words start with "ä" but they cannot easily tell you which words start with 'ㄱ' (Hangul kiyok, letter K roughly). NFG cannot directly answer the latter question either.

A hypothetical "NFP" would work better where separable letters and diacritical marks were combined and separable letters in syllabic forms were split out. Of course to do that accurately one would either need custom internal code points or variable length elements, just like NFG. Doing it inaccurately could be done with a simple hybrid of NFC and NFD.

I also wonder whether it would be worthwhile for Perl 6 to have a NFC mode, despite the limitations, if not something like "NFP" in either limited or extended form.

Is the current Unicode design impractical?

Posted Dec 17, 2015 7:59 UTC (Thu) by raiph (guest, #89283) [Link]

> NFC and NFD can tell you which words start with "ä" but they cannot easily tell you which words start with 'ㄱ' (Hangul kiyok, letter K roughly).
>
> NFG cannot directly answer the latter question either.

say "ㄱfoo".starts-with("ㄱ")
True

say "ㄱbc def gㄱi ㄱkl".words.grep(*.starts-with("ㄱ"))
(ㄱbc ㄱkl)

say "0123ㄱ56789".index("ㄱ")
4

say "01234fooㄱbar56789".comb
(0 1 2 3 4 f o o ㄱ b a r 5 6 7 8 9)

say S[ㄱ] = 4 with "0123ㄱ56789"
0123456789

----

This would be a good time to ask: Are you still 100% sure EGC is 100% wrong for sorting Hangul text?

----

Your NFP sounds to me like Perl 6's NFG by another name.

----

Thank you for this exchange so far.

Would you be willing to join the public freenode IRC #perl6 channel for a few minutes to discuss processing of Hangul text and to try nail down what we've discussed in this thread, eg sorting of notably awkward Hangul text strings?

Here's the web page I use to join #perl6:
https://kiwiirc.com/client/irc.freenode.net/perl6

You or others can type simple statements like `m: say "0123ㄱ56789".index("ㄱ")` to be evaluated by the in channel "Camelia" evalbot. There are bound to be bugs and other issues that you can unearth. It would be wonderful to see someone appear in the IRC logs doing Hangul text evals and to see appropriate tickets getting added to the bug tracker. :)

Here's the #perl6 log if you want to see examples of evalbot use and to see what sort of crazies you'd be joining: http://irclog.perlgeek.de/perl6/today

If there is anything I can do to make it more comfortable and convenient for you to join #perl6, please let me know.

Is the current Unicode design impractical?

Posted Dec 18, 2015 21:09 UTC (Fri) by butlerm (guest, #13312) [Link]

> This would be a good time to ask: Are you still 100% sure EGC is 100% wrong for sorting Hangul text?

As it happens, NFC will sort most Hangul text correctly because it is designed that way. They allocated eleven thousand something codepoints for the purpose. And Korean users these days really do collate syllable by syllable rather than as a string of letters or jamo, which makes that sort of thing easy as long as you are dealing with a modern combination. Older, historical combinations do not have their own code points - they are encoded as a series of letters, which is a little bit tricky when it comes time to determine the boundaries between syllables.

The issue is for regular expression purposes it often makes more sense to match on individual letters more like NFD rather than syllables more like NFC / NFG. And then you practically need two unicode string subtypes - combined and uncombined. Or even better three, combined, uncombined, and partially combined. NFP here is "partially combined" Unicode wise, although I was thinking more along the lines of phoneme by phoneme.

The idea is to combine the things like diacritics that you really don't want to split out and not combine things like syllables composed of a separate series of letters. And that is the major issue with saying all Unicode strings are sequences of ECGs - it is not always convenient or what you want. Sometimes the division is important.

So if you go to all the trouble to allow elements of a string to include more than one codepoint, perhaps it would be much better to leave what is combined and what isn't up to the user on a string by string basis. Allow programs to convert a string to NFG form, NFC form, NFD form, NFP form, or an unspecified form and have the result be able to be processed as such. That may already be the plan, but it is important.

I will think about trying this out, as a matter of curiosity if nothing else. Thanks.

Is the current Unicode design impractical?

Posted Dec 20, 2015 19:37 UTC (Sun) by raiph (guest, #89283) [Link]

Thanks, and 행운을 빕니다 [1]

Recapping, this exchange began with an assertion that EGC is problematic as a general basis for segmenting text in to characters. The exemplar was sorting Hangul (Korean) text.

Aiui:

> So if you go to all the trouble to allow elements of a string to include more than one codepoint

Text processing systems implementing Unicode compliant support for programmatic distinction of "what a user thinks of as a character" (quoting Unicode.org) in arbitrary Unicode text must deal with characters that include more than one codepoint.[2]

> perhaps it would be much better to leave what is combined and what isn't up to the user on a string by string basis.

It needs to be more fine grained than that in the sense that, according to the Unicode standard, the right way to segment any given string depends on what operation you're applying to that string as well as in what language/locale.

So the same string might, at least in principle, be segmented in one way for sorting and another for regexing under one locale setting and then in third and forth ways with another locale setting in effect.

> [perhaps one should] Allow programs to convert a string to NFG form, NFC form, NFD form, NFP form, or an unspecified form and have the result be able to be processed as such.

Unicode calls this notion of arbitrary (unspecified) forms "tailored grapheme clusters" (quoting Unicode.org).

http://cldr.unicode.org/development/development-process/d... may provide more insight. Among the bugs listed at the end, note in particular http://unicode.org/cldr/trac/ticket/2142 for "Alternate Grapheme Clusters" that was filed 7 years ago and switched from new to accepted status 7 months ago.

NFG is intended to be a general conceptual and implementation mechanism dealing with grapheme clustering in Perl 6 and Rakudo. If the current implementation of NFG doesn't already embrace customizable grapheme clustering similar to the aforementioned Alternate Grapheme Clusters, it surely will when the limits of EGC become clear and someone tackles the work that needs to be done.

I see your NFP idea[3] as an example of an alternate grapheme clustering which would be an NFG variant if it were implemented.

> I will think about trying this out, as a matter of curiosity if nothing else. Thanks.

If you find time I suggest you consider trying at least two text processing systems or programming languages whose developers *explicitly aspire to try to get grapheme clustering right* so you can compare them. Afaik the most mature ones are the CLDR / ICU projects and systems that build on them. The Rakudo Perl 6 compiler implementation[4] does *not* (directly) use CLDR/ICU; it may be a particularly interesting exercise to compare Rakudo results with those of another programming language which does use CLDR/ICU and claims to do grapheme clustering.

----

[1] Good luck in Korean according to translate.google.com :)

[2] One can not support programmatic manipulation of "what a user thinks of as a character" (quoting Unicode.org) in a Unicode compliant manner unless one does indeed go to all the trouble of allowing elements of a string to include more than one code point. No mainstream programming language has attempted to make this simple until very recently but that should not be taken as evidence that it isn't a functional requirement for mature Unicode compliance.

(Btw, you earlier suggested "codepoints for every letter equivalent in actual use, so that NFC is complete". Unfortunately "normalization form NFC (the composed form favored for use on the Web) is frozen" (quoting Unicode.org). So while they've frozen in place a set of precomposed graphemes for contemporary Hangul making the NFC shortcut work out, this approach breaks down when applied generally for arbitrary Unicode text.)

[3] (For other readers, NFP is a mooted Normalization Form Phoneme for use when sorting text, at least Hangul text.) butlerm, my review of Unicode.org documents triggered by this exchange has increased my confidence that a "grapheme" per the Unicode spec is a completely general concept ("A minimally distinctive unit of writing in the context of a particular writing system") and that the notion of a "formatting" unit that you've mentioned is just one example of a grapheme and that other units such as one corresponding to a phoneme are, in Unicode parlance, just graphemes of different flavors.

[4] The reference Perl 6 compiler Rakudo implements grapheme clustering via its MoarVM (http://moarvm.org) backend.

Unicode, Perl 6, and You

Posted Dec 21, 2015 13:42 UTC (Mon) by oldtomas (guest, #72579) [Link]

> Phoneme bears a much closer relationship to Unicode code point (for most languages) [...]

Oh, au contraire... (try to decompose that into phonemes and you'll see ;-) And no, English is no better. It's just that we offloaded all that processing to some part of our brain where we are barely aware of it.

> The letter 'a' plus diaeresis for example should probably never be split at all [...]

Let's take the letter u with diaeresis. It exists in several languages living in the tiny West appendix of Europe: German, French and Spanish (and some others, like Catalan). But it has a very different meaning!

Whereas in German it's a shorthand for a diphtong "ue" (think of it as a funky ligature), that is: it collates (traditionally) like "ue"; you transcribe it as "ue" whenever the little dots aren't available, etc. -- in French and Spanish (and Catalan...) it's just a "decorated u", meaning "this is just an u, but you gotta pronounce it loudly although the context would call for a mute letter". You'd collate it as an "u" and transcribe it as an "u", should the little dots be missing in your typesetter's box.

This is especially nasty for iconv when converting to ASCII, since it would have to know whether it's dealing with a German "ü" or a French "ü".

And this is just a little, trivial example which fits totally whithin the Latin-1 character set...

Unicode, Perl 6, and You

Posted Dec 25, 2015 0:24 UTC (Fri) by micka (subscriber, #38720) [Link]

> Oh, au contraire... (try to decompose that into phonemes and you'll see ;-)

(Not saying anything about whether you're right or wrong...)
"Oh" and "Au" are different sound in french, but that strange, wiktionary and larousse give the same phonetic representation. I have to reconsider a prejudice I had, that phonetic representation was in bijection to actual sounds.


Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds