Fedora opens up to bundling

Posted Oct 22, 2015 9:34 UTC (Thu) by Wol (subscriber, #4433)
In reply to: Fedora opens up to bundling by cry_regarder
Parent article: Fedora opens up to bundling

And then there's Unicode ... :-)

I use lilypond. Which is still stuck on Guile v1.

"cat input | guile > output"

Assuming all guile does is read from stdin and copy to stdout, unfortunately Guile v2 breaks the expectation that "input == output", and this breaks lilypond :-( In very subtle and hard-to-fix ways :-(

So this is another, pretty classic, example of a program with bundled dependencies because mainstream (no longer) provides features on which it relies.

For info, as I understand it, Guile v2 converts deprecated code points on the fly to up-to-date ones. All composite characters (eg a-umlaut, e-acute etc) are now deprecated and should be <compose><accent><letter> or whatever the correct version is. Lilypond assumes that a character offset in the input will point to the same character in the output, and of course if any of these conversions has happened, it breaks that assumption :-(

Cheers,
Wol

guile and unicode

Posted Oct 22, 2015 13:26 UTC (Thu) by wingo (guest, #26929) [Link] (2 responses)

Not sure why Guile is getting the blame here, but to clear things up:

Before Guile 2.0, Guile's strings were byte strings -- like in Python 2. Guile 2 changed so that strings were composed of characters. To Guile, a character is a unicode codepoint.

When reading strings from a byte stream, as from an fd, those characters have an encoding, which is usually taken from your locale. In some encodings, like ISO-8859-1, all byte sequences are valid, so reading data in and writing it out will produce the same byte sequence. In others, like UTF-8, maybe Guile could read an invalid sequence. In that case it can error, or replace the character with "?", depending on what the application chose to do. Likewise when writing, it could be Guile tries to write a codepoint that can't be expressed in the desired encoding; at which point it can error or write a ?. The application decides what strategy to take.

There is no such thing as a deprecated codepoint.

guile and unicode

Posted Oct 22, 2015 17:53 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

> There is no such thing as a deprecated codepoint.

aiui, there is a unicode character for a-acute. There is also the sequence <compose><acute><a>. What are you going to do when one string uses one encoding, and another string uses the other? Apparently, the Unicode spec now says you are supposed to use the <compose> sequence, which Guile v2 implements.

Hence lilypond blowing up when what it thinks is a string COPY, is turned by v2 into a string TRANSLATION :-( Please note, that BOTH the input AND the output in this case are not some random encoding, but are quite explicitly Unicode character strings.

Cheers,
Wol

guile and unicode

Posted Oct 22, 2015 20:18 UTC (Thu) by wingo (guest, #26929) [Link]

This is a bit far afield of the original article, but you persist in a misunderstanding about a project that I maintain :) To Guile 1.x, a character is a byte. To Guile 2.x, a character is a unicode codepoint: not a grapheme.

So when Guile reads a byte sequence which according to the given locale it decodes as U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT, those are the code points it stores internally. It does not normalize the codepoint sequence, although there are the string-normalize-nfc, string-normalize-nfd, string-normalize-nfkc, and string-normalize-nfkd procedures if the application chooses to do so, for whatever reason.