> Go from O(1) to O(n) on individual codepoint access and suddenly O(n)
> stuff on strings goes to O(n^2) and so on: not remotely good.
That clearly only happens if your language doesn't have such a thing as iterators.
"increment(character_iterator)" is still O(1) even if your underlying representation is
UTF-8. The need to access an arbitrary numbered unicode codepoint in a string in
constant time isn't really all that useful.
Unicode codepoints don't really correspond to anything humans care about...splitting
a string in the middle of a Grapheme is really just as bad as splitting it in the middle of
a UTF-8 codepoint-sequence.
Posted Oct 31, 2009 1:11 UTC (Sat) by spitzak (guest, #4593)
[Link]
You are exactly the sort of misguided person who is destroying UTF-8.
Please explain EXACTLY where the "N" comes from that you are passing to your "go to the Nth UTF-16 code point" function. Answer: it is calculated by looking at all the preceeding N-1 "characters" and therefore it is a misguided attempt to store an iterator in a integer, and that it can be trivially replaced by a real iterator that uses a byte offset or pointer.
"Unicode"
Posted Oct 31, 2009 1:29 UTC (Sat) by nix (subscriber, #2304)
[Link]
You're just repeating what foom said in different words, I think.
"Unicode"
Posted Oct 31, 2009 1:29 UTC (Sat) by nix (subscriber, #2304)
[Link]
Sorry, I misinterpreted you. Of course it's more complicated to iterate
over strings now, but really not much more, and UTF-8 (unlike the
fixed-width multibyte encodings) is easy to resync to if you start from an
arbitrary byte, so things like binary searches in long strings are still
possible with a tiny bit of extra tweaking.
And, agreed, the ability to treat a string as a fixed-width array is
really quite unimportant: generally people iterate over strings rather
than leaping to position N. (You meant 'position' or 'offset', though,
not 'codepoint', which is entirely different. Codepoint 'access' isn't
even a particularly meaningful concept: what does it mean to 'access'
ASCII codepoint 65? Codepoints just *are*.)