LWN.net Logo

"Unicode"

"Unicode"

Posted Oct 31, 2009 0:27 UTC (Sat) by nix (subscriber, #2304)
In reply to: "Unicode" by foom
Parent article: Proposal: Moratorium on Python language changes

O(1) *anything* is a good property to retain, IMO, especially with regard
to things as fundamental as characters.

Go from O(1) to O(n) on individual codepoint access and suddenly O(n)
stuff on strings goes to O(n^2) and so on: not remotely good.


(Log in to post comments)

"Unicode"

Posted Oct 31, 2009 0:50 UTC (Sat) by foom (subscriber, #14868) [Link]

> Go from O(1) to O(n) on individual codepoint access and suddenly O(n)
> stuff on strings goes to O(n^2) and so on: not remotely good.

That clearly only happens if your language doesn't have such a thing as iterators.
"increment(character_iterator)" is still O(1) even if your underlying representation is
UTF-8. The need to access an arbitrary numbered unicode codepoint in a string in
constant time isn't really all that useful.

Unicode codepoints don't really correspond to anything humans care about...splitting
a string in the middle of a Grapheme is really just as bad as splitting it in the middle of
a UTF-8 codepoint-sequence.

"Unicode"

Posted Oct 31, 2009 1:11 UTC (Sat) by spitzak (guest, #4593) [Link]

You are exactly the sort of misguided person who is destroying UTF-8.

Please explain EXACTLY where the "N" comes from that you are passing to your "go to the Nth UTF-16 code point" function. Answer: it is calculated by looking at all the preceeding N-1 "characters" and therefore it is a misguided attempt to store an iterator in a integer, and that it can be trivially replaced by a real iterator that uses a byte offset or pointer.

"Unicode"

Posted Oct 31, 2009 1:29 UTC (Sat) by nix (subscriber, #2304) [Link]

You're just repeating what foom said in different words, I think.

"Unicode"

Posted Oct 31, 2009 1:29 UTC (Sat) by nix (subscriber, #2304) [Link]

Sorry, I misinterpreted you. Of course it's more complicated to iterate
over strings now, but really not much more, and UTF-8 (unlike the
fixed-width multibyte encodings) is easy to resync to if you start from an
arbitrary byte, so things like binary searches in long strings are still
possible with a tiny bit of extra tweaking.

And, agreed, the ability to treat a string as a fixed-width array is
really quite unimportant: generally people iterate over strings rather
than leaping to position N. (You meant 'position' or 'offset', though,
not 'codepoint', which is entirely different. Codepoint 'access' isn't
even a particularly meaningful concept: what does it mean to 'access'
ASCII codepoint 65? Codepoints just *are*.)

"Unicode"

Posted Nov 2, 2009 9:46 UTC (Mon) by njs (subscriber, #40338) [Link]

You can store text in UTF-8 and still have O(log n) random access (and insertion!) by storing your string in an charpoint-offset indexed tree. No-one seems to bother implementing this, though, presumably because it's complicated, and the overhead only amortizes out when you need to do random access/insertion on large chunks of text, which turns out to be a fairly rare need in practice. Also, converting from UCS-4 to UTF-8 is a pessimization for CJK.

"Unicode"

Posted Nov 2, 2009 18:49 UTC (Mon) by spitzak (guest, #4593) [Link]

UTF-8 is self-synchronizing, meaning that you can find character boundaries in O(1) time and thus you can do random access/insertion of large chunks of text, even if the insertion point is somehow generated at random.

What you can't do is define the insertion point as "after N repetitions of this regexp" which is really what is wanted when people say "characters". But no other data structure would dare to require this as the basic iterator. For some reason though text makes otherwise intelligent programmers into morons and they are just blind to the obvious solution.

As for CJK, I think you meant UTF-16, not UCS-4. Converting from UCS-4 to anything will save memory, as it uses 4 bytes per Unicode code point. UTF-16 does use only 2 bytes for 0x800-0xFFFF while UTF-8 uses 3. However UTF-8 uses one byte for 0x00-0x7F and UTF-16 uses 2, so in fact the text will be smaller if there are more of these. In real CJK texts (ie not just single words) there are, because the ASCII range has spaces, newlines, numbers, all english quoted words, and all the XML markup, while the 3-byte ones are generally one-per-word and thus outnumbered. I believe however that some lengthy east-Indian texts, which use a phonetic alphabet that unfortunately requires 3-bytes per letter in UTF-8, is less efficient in UTF-8 than in UTF-16.

In any case talking about compression is silly, as you can use ZIP to turn any large document into far less than 1 byte per Unicode code point.

"Unicode"

Posted Nov 2, 2009 20:31 UTC (Mon) by njs (subscriber, #40338) [Link]

> UTF-8 is self-synchronizing, meaning that you can find character boundaries in O(1) time and thus you can do random access/insertion of large chunks of text, even if the insertion point is somehow generated at random.

Yes, I know. I thought we were talking about random access using character offsets, rather than byte offsets, though -- at least, that's what I was talking about in my comment. My point is that you can still do better than O(n) for arbitrary character access.

I don't really understand what point you're making about regexps -- all the utf-8 apis I know provide character iterators. I am, though, skeptical that the authors are really all morons, and not sure that claiming they are really adds anything to the conversation.

Re: UTF-8 vs. UCS-4/UTF-16: You're right, I misremembered. UTF-8 and UTF-16 are identical in terms of the hassle of doing random access indexing, and both are more memory-efficient than UCS-4, so I guess everything I said applies to both.

I mentioned compression because the original poster complained that UCS-4 was wasteful of memory; one of the motivations for using UTF-8 instead is that it gives some effective compression. Obviously for long-term compressed storage there are better solutions, but that's not what we're talking about.

"Unicode"

Posted Nov 3, 2009 4:01 UTC (Tue) by dvdeug (subscriber, #10998) [Link]

If you really want compressed in-core string storage, there's SCSU or BOCU-1. The memory versus code tradeoff is generally not considered worth it, though.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds