LWN.net Logo

"Unicode"

"Unicode"

Posted Nov 2, 2009 20:31 UTC (Mon) by njs (subscriber, #40338)
In reply to: "Unicode" by spitzak
Parent article: Proposal: Moratorium on Python language changes

> UTF-8 is self-synchronizing, meaning that you can find character boundaries in O(1) time and thus you can do random access/insertion of large chunks of text, even if the insertion point is somehow generated at random.

Yes, I know. I thought we were talking about random access using character offsets, rather than byte offsets, though -- at least, that's what I was talking about in my comment. My point is that you can still do better than O(n) for arbitrary character access.

I don't really understand what point you're making about regexps -- all the utf-8 apis I know provide character iterators. I am, though, skeptical that the authors are really all morons, and not sure that claiming they are really adds anything to the conversation.

Re: UTF-8 vs. UCS-4/UTF-16: You're right, I misremembered. UTF-8 and UTF-16 are identical in terms of the hassle of doing random access indexing, and both are more memory-efficient than UCS-4, so I guess everything I said applies to both.

I mentioned compression because the original poster complained that UCS-4 was wasteful of memory; one of the motivations for using UTF-8 instead is that it gives some effective compression. Obviously for long-term compressed storage there are better solutions, but that's not what we're talking about.


(Log in to post comments)

"Unicode"

Posted Nov 3, 2009 4:01 UTC (Tue) by dvdeug (subscriber, #10998) [Link]

If you really want compressed in-core string storage, there's SCSU or BOCU-1. The memory versus code tradeoff is generally not considered worth it, though.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds