LWN.net Logo

"Unicode"

"Unicode"

Posted Nov 2, 2009 9:46 UTC (Mon) by njs (subscriber, #40338)
In reply to: "Unicode" by nix
Parent article: Proposal: Moratorium on Python language changes

You can store text in UTF-8 and still have O(log n) random access (and insertion!) by storing your string in an charpoint-offset indexed tree. No-one seems to bother implementing this, though, presumably because it's complicated, and the overhead only amortizes out when you need to do random access/insertion on large chunks of text, which turns out to be a fairly rare need in practice. Also, converting from UCS-4 to UTF-8 is a pessimization for CJK.


(Log in to post comments)

"Unicode"

Posted Nov 2, 2009 18:49 UTC (Mon) by spitzak (guest, #4593) [Link]

UTF-8 is self-synchronizing, meaning that you can find character boundaries in O(1) time and thus you can do random access/insertion of large chunks of text, even if the insertion point is somehow generated at random.

What you can't do is define the insertion point as "after N repetitions of this regexp" which is really what is wanted when people say "characters". But no other data structure would dare to require this as the basic iterator. For some reason though text makes otherwise intelligent programmers into morons and they are just blind to the obvious solution.

As for CJK, I think you meant UTF-16, not UCS-4. Converting from UCS-4 to anything will save memory, as it uses 4 bytes per Unicode code point. UTF-16 does use only 2 bytes for 0x800-0xFFFF while UTF-8 uses 3. However UTF-8 uses one byte for 0x00-0x7F and UTF-16 uses 2, so in fact the text will be smaller if there are more of these. In real CJK texts (ie not just single words) there are, because the ASCII range has spaces, newlines, numbers, all english quoted words, and all the XML markup, while the 3-byte ones are generally one-per-word and thus outnumbered. I believe however that some lengthy east-Indian texts, which use a phonetic alphabet that unfortunately requires 3-bytes per letter in UTF-8, is less efficient in UTF-8 than in UTF-16.

In any case talking about compression is silly, as you can use ZIP to turn any large document into far less than 1 byte per Unicode code point.

"Unicode"

Posted Nov 2, 2009 20:31 UTC (Mon) by njs (subscriber, #40338) [Link]

> UTF-8 is self-synchronizing, meaning that you can find character boundaries in O(1) time and thus you can do random access/insertion of large chunks of text, even if the insertion point is somehow generated at random.

Yes, I know. I thought we were talking about random access using character offsets, rather than byte offsets, though -- at least, that's what I was talking about in my comment. My point is that you can still do better than O(n) for arbitrary character access.

I don't really understand what point you're making about regexps -- all the utf-8 apis I know provide character iterators. I am, though, skeptical that the authors are really all morons, and not sure that claiming they are really adds anything to the conversation.

Re: UTF-8 vs. UCS-4/UTF-16: You're right, I misremembered. UTF-8 and UTF-16 are identical in terms of the hassle of doing random access indexing, and both are more memory-efficient than UCS-4, so I guess everything I said applies to both.

I mentioned compression because the original poster complained that UCS-4 was wasteful of memory; one of the motivations for using UTF-8 instead is that it gives some effective compression. Obviously for long-term compressed storage there are better solutions, but that's not what we're talking about.

"Unicode"

Posted Nov 3, 2009 4:01 UTC (Tue) by dvdeug (subscriber, #10998) [Link]

If you really want compressed in-core string storage, there's SCSU or BOCU-1. The memory versus code tradeoff is generally not considered worth it, though.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds