UTF8 vs UTF32

Posted Mar 24, 2010 17:57 UTC (Wed) by foom (subscriber, #14868)
In reply to: UTF8 vs UTF32 by dlang
Parent article: Resetting PHP 6

> what it is good for is if you are doing a lot of string manipulation by position (i.e. you know that
> the field you want starts at position 50)

I've never actually heard of a good reason for anyone to want that. Remember that in Unicode,
what the user thinks of a character might in fact be a composition of many unicode codepoints (e.g.
base character + accent). Accessing the Nth codepoint is an almost entirely useless thing to
optimize your storage for.

And yet, most programming languages have made this mistake, because they treat strings as an
array of codepoints, rather than as a unique data structure. All you really want is forward-and-
backward iterable and random access given an iterator.

UTF8 vs UTF32

Posted Mar 24, 2010 19:49 UTC (Wed) by flewellyn (subscriber, #5047) [Link] (3 responses)

>I've never actually heard of a good reason for anyone to want that.

If you have to translate from an old, COBOL-style fixed-width data format into something modern and/or at least reasonably sane, that's a very good reason to want just that.

UTF8 vs UTF32

Posted Mar 24, 2010 20:44 UTC (Wed) by foom (subscriber, #14868) [Link] (2 responses)

I'm gonna bet the old COBOL stuff is not actually outputting Unicode. :)

UTF8 vs UTF32

Posted Mar 24, 2010 20:45 UTC (Wed) by flewellyn (subscriber, #5047) [Link] (1 responses)

Quite. But that IS a case where you want to do string manipulation by position.

UTF8 vs UTF32

Posted Mar 25, 2010 14:14 UTC (Thu) by dgm (subscriber, #49227) [Link]

Fine then. If you limit yourself to the ASCII subset, UTF-8 offers you that. Is that enough for COBOL?

Iterators vs indices

Posted Mar 25, 2010 23:08 UTC (Thu) by butlerm (subscriber, #13312) [Link] (9 responses)

I've never actually heard of a good reason for anyone to want that.

Most languages (and SQL in particular) work exclusively using string indexes. You cannot use an iterator in a functional context. SQL doesn't have string iterators and never will. Iterators are a procedural thing.

Numeric indices are the only language independent option available for specifying and extracting parts of strings. That is why they are universal. I wouldn't use a language that didn't support them, in large part due to the difficulty of translating idioms from languages that do.

As a consequence, anything that needs to be done to efficiently present a string of any reasonable size as a linear array of _characters_ (or at least code points) is the language's problem, not the programmer's. That is the approach SQL takes, and from an application programmer's point of view it works very well.

That is not to say that an iterator interface shouldn't be provided (in procedural languages) as well, but of the two an index based interface is more fundamental.

Iterators vs indices

Posted Mar 26, 2010 13:49 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)

I think you must be using a funny definition of functional. There is absolutely nothing that prevents an iterator-based API from working in a functional language. And of course most functional languages have many such APIs.

Let's take a traditional example: singly-linked-lists are a quite common data-structure in functional (or mostly-functional) languages like Haskell, Scheme, etc. Yet, you don't index them by position (that of course is available if you need it, but it's time O(n), so you don't normally want to use it). Instead, you use an iterator, which in this case is a pointer to the current element.

If anyone suggested that the primary access method for a singly linked list should be by integer position, they'd be rightly told that's insane -- iterating over the list would take O(n^2)!

Now, maybe your real point was simply that existing languages already have a poorly-designed Unicode String API that they have to keep compatiblity with -- and that API doesn't include iterators. So, they therefore have constraints they need to preserve, such as O(1) access by character index, because existing programs require it.

I won't argue with that, but I still assert it's not actually a useful feature for a unicode string API, in absence of the API-compatibility requirement.

Iterators vs indices

Posted Mar 26, 2010 22:09 UTC (Fri) by spitzak (guest, #4593) [Link]

If you really can't get away from the integer index, a solution is to have a string format that stores the most recent index computed and where it was in the string. Then when asked for a new index it will move from that previous position to the new one if the previous position is less that 2x the new one.

For the vast majority of cases where each integer starting from zero is used to get the "character" this would put the implementation back to O(1). And it would allow more complex accessors, such as "what error is here".

Iterators vs indices

Posted Mar 30, 2010 6:57 UTC (Tue) by njs (subscriber, #40338) [Link] (6 responses)

Another option is to store a string as a tree structure, where the leaves are some reasonable-sized chunks of bytes (to amortize storage overhead), and the tree nodes are annotated with the number of characters/bytes/code points/lines/whatever that occur underneath them. This allows random O(log n) access by character/byte/... offset. (You can maintain several different sorts of counts, and get fast access for all of them in the same data structure.) You also get cheap random insertion/deletion, which is an important operation for some tasks (e.g., editor buffers!) but horrendously slow for arrays.

For some reason nobody does this, though.

Iterators vs indices

Posted Mar 30, 2010 7:11 UTC (Tue) by dlang (guest, #313) [Link] (3 responses)

the biggest reason nobody stores strings that way is the overhead. it requires many pointers which end up making UTF-32 look compact by comparison.

besides, as noted earlier in this thread, most uses of strings really don't care how they break apart, they are almost always used as-is (or at most with one step of parsing, usually on whitespace, on input) as such, anything more than the most compact representation ends up costing significantly more in memory size (and therefor cache space) than you gain with any string manipulation that you do

Google Wave actually stores strings the way you are suggesting, or did when I saw the presentation on it last year, but I think that doing so will keep it from being used for anything beyond trivial uses.

Iterators vs indices

Posted Mar 30, 2010 7:46 UTC (Tue) by njs (subscriber, #40338) [Link] (2 responses)

> the biggest reason nobody stores strings that way is the overhead. it requires many pointers which end up making UTF-32 look compact by comparison.

The memory overhead is certainly not as high as UCS-32 (at least for strings where UTF-8 has lower overhead than UCS-32 to start with) -- you need something like 3*log_2(n) words of overhead, but n is the number of "chunks", not bytes, and a reasonable chunk-size is in the hundreds of bytes, at least. Within a chunk you revert to linear behavior, but that's not so bad, IIUC on modern CPUs linear-time is not much worse than constant-time when it comes to accessing short arrays.

Most strings are short, and with proper tuning they'd probably fit into one chunk anyway, so the overhead is nearly nil.

But you're right, there is some overhead -- not that this stops people from using scripting languages -- and a lot of tricky implementation, and simple solutions are often good enough.

I don't understand what you mean about Google Wave, though. A) Isn't it mostly a protocol? Where do string storage APIs come in? B) It's exactly the non-trivial uses -- where you have large, mutable strings -- that arrays and linear-time iteration don't scale to.

Iterators vs indices

Posted Mar 31, 2010 2:05 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

I understood the poster to mean using pointers for individual characters (how else can you do inserts at any point in the string without having to know how it's structured)

google wave uses the jabber protocol, but in it's documents it doesn't store words, it stores the letters individually, grouped togeather so that they can be changed individually (or so it was explained by the google rep giving the presentation I was at)

Iterators vs indices

Posted Mar 31, 2010 4:30 UTC (Wed) by njs (subscriber, #40338) [Link]

> I understood the poster to mean using pointers for individual characters (how else can you do inserts at any point in the string without having to know how it's structured)

I'm afraid I don't understand at all. I *am* that poster, and the data structure I described can do O(log n) inserts without pointers to individual characters. Perhaps I am just explaining badly?

Iterators vs indices

Posted Mar 30, 2010 8:12 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

Didn't the GNU C++ ext/rope work in exactly this way?

Iterators vs indices

Posted Mar 30, 2010 17:06 UTC (Tue) by njs (subscriber, #40338) [Link]

No, interestingly -- they are more complicated and less like a conventional tree structure than one would think: http://www.sgi.com/tech/stl/ropeimpl.html

The most important difference is that ropes are happy -- indeed, delighted -- to store very long strings inside a single tree node when they have the chance, because their goal is just to amortize mutation operations, not to provide efficient access by semi-arbitrary index rules.