Iterators vs indices

Posted Mar 30, 2010 6:57 UTC (Tue) by njs (subscriber, #40338)
In reply to: Iterators vs indices by butlerm
Parent article: Resetting PHP 6

Another option is to store a string as a tree structure, where the leaves are some reasonable-sized chunks of bytes (to amortize storage overhead), and the tree nodes are annotated with the number of characters/bytes/code points/lines/whatever that occur underneath them. This allows random O(log n) access by character/byte/... offset. (You can maintain several different sorts of counts, and get fast access for all of them in the same data structure.) You also get cheap random insertion/deletion, which is an important operation for some tasks (e.g., editor buffers!) but horrendously slow for arrays.

For some reason nobody does this, though.

Iterators vs indices

Posted Mar 30, 2010 7:11 UTC (Tue) by dlang (guest, #313) [Link] (3 responses)

the biggest reason nobody stores strings that way is the overhead. it requires many pointers which end up making UTF-32 look compact by comparison.

besides, as noted earlier in this thread, most uses of strings really don't care how they break apart, they are almost always used as-is (or at most with one step of parsing, usually on whitespace, on input) as such, anything more than the most compact representation ends up costing significantly more in memory size (and therefor cache space) than you gain with any string manipulation that you do

Google Wave actually stores strings the way you are suggesting, or did when I saw the presentation on it last year, but I think that doing so will keep it from being used for anything beyond trivial uses.

Iterators vs indices

Posted Mar 30, 2010 7:46 UTC (Tue) by njs (subscriber, #40338) [Link] (2 responses)

> the biggest reason nobody stores strings that way is the overhead. it requires many pointers which end up making UTF-32 look compact by comparison.

The memory overhead is certainly not as high as UCS-32 (at least for strings where UTF-8 has lower overhead than UCS-32 to start with) -- you need something like 3*log_2(n) words of overhead, but n is the number of "chunks", not bytes, and a reasonable chunk-size is in the hundreds of bytes, at least. Within a chunk you revert to linear behavior, but that's not so bad, IIUC on modern CPUs linear-time is not much worse than constant-time when it comes to accessing short arrays.

Most strings are short, and with proper tuning they'd probably fit into one chunk anyway, so the overhead is nearly nil.

But you're right, there is some overhead -- not that this stops people from using scripting languages -- and a lot of tricky implementation, and simple solutions are often good enough.

I don't understand what you mean about Google Wave, though. A) Isn't it mostly a protocol? Where do string storage APIs come in? B) It's exactly the non-trivial uses -- where you have large, mutable strings -- that arrays and linear-time iteration don't scale to.

Iterators vs indices

Posted Mar 31, 2010 2:05 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

I understood the poster to mean using pointers for individual characters (how else can you do inserts at any point in the string without having to know how it's structured)

google wave uses the jabber protocol, but in it's documents it doesn't store words, it stores the letters individually, grouped togeather so that they can be changed individually (or so it was explained by the google rep giving the presentation I was at)

Iterators vs indices

Posted Mar 31, 2010 4:30 UTC (Wed) by njs (subscriber, #40338) [Link]

> I understood the poster to mean using pointers for individual characters (how else can you do inserts at any point in the string without having to know how it's structured)

I'm afraid I don't understand at all. I *am* that poster, and the data structure I described can do O(log n) inserts without pointers to individual characters. Perhaps I am just explaining badly?

Iterators vs indices

Posted Mar 30, 2010 8:12 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

Didn't the GNU C++ ext/rope work in exactly this way?

Iterators vs indices

Posted Mar 30, 2010 17:06 UTC (Tue) by njs (subscriber, #40338) [Link]

No, interestingly -- they are more complicated and less like a conventional tree structure than one would think: http://www.sgi.com/tech/stl/ropeimpl.html

The most important difference is that ropes are happy -- indeed, delighted -- to store very long strings inside a single tree node when they have the chance, because their goal is just to amortize mutation operations, not to provide efficient access by semi-arbitrary index rules.