The future of Emacs, Guile, and Emacs Lisp

Posted Oct 9, 2014 8:23 UTC (Thu) by ncm (guest, #165)
Parent article: The future of Emacs, Guile, and Emacs Lisp

For Guile to migrate to a UTF-8 internal string representation reads like an overwhelmingly good idea, whatever the impetus for considering it. It is possible, in principle, that the current representation is better, but it also seems exceedingly unlikely.

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 9, 2014 22:34 UTC (Thu) by smurf (subscriber, #17840) [Link] (4 responses)

UTF-8 has one disadvantage: It's slightly more complex to find the n'th-next (or previous) character, which is important to the speed of pattern matching in some cases.

However, it has the distinct advantage that your large ASCII text does not suddenly need eight times the storage space just because you insert a character with a smiling kitty face.

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 9, 2014 22:49 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

Combining characters mean you're going to take a hit with Unicode whatever the representation.

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 10, 2014 16:08 UTC (Fri) by lambda (subscriber, #40735) [Link] (1 responses)

Except, when pattern matching UTF-8, you can generally just match on the bytes (code units) directly, rather than on the characters (codepoints); the algorithms that need to skip ahead by a fixed n characters are generally the exact string matching algorithms like Boyer-Moore and Knuth-Morris-Pratt. There's no reason to require that those be run on the codepoints instead of on the bytes.

If you're doing regular expression matching with Unicode data, even if you use UTF-32, you will need to consume variable length strings as single characters, as you can have decomposed characters that need to match as a single character.

People always bring up lack of constant codepoint indexing when UTF-8 is mentioned, but I have never seen an example in which you actually need to index by codepoint, that doesn't either break in the face of other issues like combining sequences, or can't be solved by just using code unit indexing.

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 12, 2014 6:12 UTC (Sun) by k8to (guest, #15413) [Link]

This view dates back to a time when UCS-2 was fixed size (whatever its name was then) or then when the predecessor for UTF32 was fixed size. As you point out, both of those eras passed.

It's a little more tedious to CUT a UTF8 string safely based on a size computed in bytes than in some other encodings, but not much more, and that's very rarely a fast path.

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 14, 2014 17:29 UTC (Tue) by Trelane (subscriber, #56877) [Link]

To my mind, the biggest problem is not multibyte characters but rather combining characters