LWN.net Logo

"Unicode"

"Unicode"

Posted Oct 31, 2009 0:50 UTC (Sat) by foom (subscriber, #14868)
In reply to: "Unicode" by nix
Parent article: Proposal: Moratorium on Python language changes

> Go from O(1) to O(n) on individual codepoint access and suddenly O(n)
> stuff on strings goes to O(n^2) and so on: not remotely good.

That clearly only happens if your language doesn't have such a thing as iterators.
"increment(character_iterator)" is still O(1) even if your underlying representation is
UTF-8. The need to access an arbitrary numbered unicode codepoint in a string in
constant time isn't really all that useful.

Unicode codepoints don't really correspond to anything humans care about...splitting
a string in the middle of a Grapheme is really just as bad as splitting it in the middle of
a UTF-8 codepoint-sequence.


(Log in to post comments)

"Unicode"

Posted Oct 31, 2009 1:11 UTC (Sat) by spitzak (guest, #4593) [Link]

You are exactly the sort of misguided person who is destroying UTF-8.

Please explain EXACTLY where the "N" comes from that you are passing to your "go to the Nth UTF-16 code point" function. Answer: it is calculated by looking at all the preceeding N-1 "characters" and therefore it is a misguided attempt to store an iterator in a integer, and that it can be trivially replaced by a real iterator that uses a byte offset or pointer.

"Unicode"

Posted Oct 31, 2009 1:29 UTC (Sat) by nix (subscriber, #2304) [Link]

You're just repeating what foom said in different words, I think.

"Unicode"

Posted Oct 31, 2009 1:29 UTC (Sat) by nix (subscriber, #2304) [Link]

Sorry, I misinterpreted you. Of course it's more complicated to iterate
over strings now, but really not much more, and UTF-8 (unlike the
fixed-width multibyte encodings) is easy to resync to if you start from an
arbitrary byte, so things like binary searches in long strings are still
possible with a tiny bit of extra tweaking.

And, agreed, the ability to treat a string as a fixed-width array is
really quite unimportant: generally people iterate over strings rather
than leaping to position N. (You meant 'position' or 'offset', though,
not 'codepoint', which is entirely different. Codepoint 'access' isn't
even a particularly meaningful concept: what does it mean to 'access'
ASCII codepoint 65? Codepoints just *are*.)

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds