> UTF-8 is self-synchronizing, meaning that you can find character boundaries in O(1) time and thus you can do random access/insertion of large chunks of text, even if the insertion point is somehow generated at random.
Yes, I know. I thought we were talking about random access using character offsets, rather than byte offsets, though -- at least, that's what I was talking about in my comment. My point is that you can still do better than O(n) for arbitrary character access.
I don't really understand what point you're making about regexps -- all the utf-8 apis I know provide character iterators. I am, though, skeptical that the authors are really all morons, and not sure that claiming they are really adds anything to the conversation.
Re: UTF-8 vs. UCS-4/UTF-16: You're right, I misremembered. UTF-8 and UTF-16 are identical in terms of the hassle of doing random access indexing, and both are more memory-efficient than UCS-4, so I guess everything I said applies to both.
I mentioned compression because the original poster complained that UCS-4 was wasteful of memory; one of the motivations for using UTF-8 instead is that it gives some effective compression. Obviously for long-term compressed storage there are better solutions, but that's not what we're talking about.
Posted Nov 3, 2009 4:01 UTC (Tue) by dvdeug (subscriber, #10998)
[Link]
If you really want compressed in-core string storage, there's SCSU or BOCU-1. The memory versus code tradeoff is generally not considered worth it, though.