The future of Emacs, Guile, and Emacs Lisp

Posted Oct 10, 2014 16:08 UTC (Fri) by lambda (subscriber, #40735)
In reply to: The future of Emacs, Guile, and Emacs Lisp by smurf
Parent article: The future of Emacs, Guile, and Emacs Lisp

Except, when pattern matching UTF-8, you can generally just match on the bytes (code units) directly, rather than on the characters (codepoints); the algorithms that need to skip ahead by a fixed n characters are generally the exact string matching algorithms like Boyer-Moore and Knuth-Morris-Pratt. There's no reason to require that those be run on the codepoints instead of on the bytes.

If you're doing regular expression matching with Unicode data, even if you use UTF-32, you will need to consume variable length strings as single characters, as you can have decomposed characters that need to match as a single character.

People always bring up lack of constant codepoint indexing when UTF-8 is mentioned, but I have never seen an example in which you actually need to index by codepoint, that doesn't either break in the face of other issues like combining sequences, or can't be solved by just using code unit indexing.

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 12, 2014 6:12 UTC (Sun) by k8to (guest, #15413) [Link]

This view dates back to a time when UCS-2 was fixed size (whatever its name was then) or then when the predecessor for UTF32 was fixed size. As you point out, both of those eras passed.

It's a little more tedious to CUT a UTF8 string safely based on a size computed in bytes than in some other encodings, but not much more, and that's very rarely a fast path.