UTF8 vs UTF32
UTF8 vs UTF32
Posted Mar 24, 2010 17:57 UTC (Wed) by foom (subscriber, #14868)In reply to: UTF8 vs UTF32 by dlang
Parent article: Resetting PHP 6
> the field you want starts at position 50)
I've never actually heard of a good reason for anyone to want that. Remember that in Unicode,
what the user thinks of a character might in fact be a composition of many unicode codepoints (e.g.
base character + accent). Accessing the Nth codepoint is an almost entirely useless thing to
optimize your storage for.
And yet, most programming languages have made this mistake, because they treat strings as an
array of codepoints, rather than as a unique data structure. All you really want is forward-and-
backward iterable and random access given an iterator.
Posted Mar 24, 2010 19:49 UTC (Wed)
by flewellyn (subscriber, #5047)
[Link] (3 responses)
If you have to translate from an old, COBOL-style fixed-width data format into something modern and/or at least reasonably sane, that's a very good reason to want just that.
Posted Mar 24, 2010 20:44 UTC (Wed)
by foom (subscriber, #14868)
[Link] (2 responses)
Posted Mar 24, 2010 20:45 UTC (Wed)
by flewellyn (subscriber, #5047)
[Link] (1 responses)
Posted Mar 25, 2010 14:14 UTC (Thu)
by dgm (subscriber, #49227)
[Link]
Posted Mar 25, 2010 23:08 UTC (Thu)
by butlerm (subscriber, #13312)
[Link] (9 responses)
Posted Mar 26, 2010 13:49 UTC (Fri)
by foom (subscriber, #14868)
[Link] (1 responses)
Let's take a traditional example: singly-linked-lists are a quite common data-structure in
functional (or mostly-functional) languages like Haskell, Scheme, etc. Yet, you don't index them
by position (that of course is available if you need it, but it's time O(n), so you don't normally
want to use it). Instead, you use an iterator, which in this case is a pointer to the current element.
If anyone suggested that the primary access method for a singly linked list should be by integer
position, they'd be rightly told that's insane -- iterating over the list would take O(n^2)!
Now, maybe your real point was simply that existing languages already have a poorly-designed
Unicode String API that they have to keep compatiblity with -- and that API doesn't include
iterators. So, they therefore have constraints they need to preserve, such as O(1) access by
character index, because existing programs require it.
I won't argue with that, but I still assert it's not actually a useful feature for a unicode string API,
in absence of the API-compatibility requirement.
Posted Mar 26, 2010 22:09 UTC (Fri)
by spitzak (guest, #4593)
[Link]
For the vast majority of cases where each integer starting from zero is used to get the "character" this would put the implementation back to O(1). And it would allow more complex accessors, such as "what error is here".
Posted Mar 30, 2010 6:57 UTC (Tue)
by njs (subscriber, #40338)
[Link] (6 responses)
For some reason nobody does this, though.
Posted Mar 30, 2010 7:11 UTC (Tue)
by dlang (guest, #313)
[Link] (3 responses)
besides, as noted earlier in this thread, most uses of strings really don't care how they break apart, they are almost always used as-is (or at most with one step of parsing, usually on whitespace, on input) as such, anything more than the most compact representation ends up costing significantly more in memory size (and therefor cache space) than you gain with any string manipulation that you do
Google Wave actually stores strings the way you are suggesting, or did when I saw the presentation on it last year, but I think that doing so will keep it from being used for anything beyond trivial uses.
Posted Mar 30, 2010 7:46 UTC (Tue)
by njs (subscriber, #40338)
[Link] (2 responses)
The memory overhead is certainly not as high as UCS-32 (at least for strings where UTF-8 has lower overhead than UCS-32 to start with) -- you need something like 3*log_2(n) words of overhead, but n is the number of "chunks", not bytes, and a reasonable chunk-size is in the hundreds of bytes, at least. Within a chunk you revert to linear behavior, but that's not so bad, IIUC on modern CPUs linear-time is not much worse than constant-time when it comes to accessing short arrays.
Most strings are short, and with proper tuning they'd probably fit into one chunk anyway, so the overhead is nearly nil.
But you're right, there is some overhead -- not that this stops people from using scripting languages -- and a lot of tricky implementation, and simple solutions are often good enough.
I don't understand what you mean about Google Wave, though. A) Isn't it mostly a protocol? Where do string storage APIs come in? B) It's exactly the non-trivial uses -- where you have large, mutable strings -- that arrays and linear-time iteration don't scale to.
Posted Mar 31, 2010 2:05 UTC (Wed)
by dlang (guest, #313)
[Link] (1 responses)
google wave uses the jabber protocol, but in it's documents it doesn't store words, it stores the letters individually, grouped togeather so that they can be changed individually (or so it was explained by the google rep giving the presentation I was at)
Posted Mar 31, 2010 4:30 UTC (Wed)
by njs (subscriber, #40338)
[Link]
I'm afraid I don't understand at all. I *am* that poster, and the data structure I described can do O(log n) inserts without pointers to individual characters. Perhaps I am just explaining badly?
Posted Mar 30, 2010 8:12 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Mar 30, 2010 17:06 UTC (Tue)
by njs (subscriber, #40338)
[Link]
The most important difference is that ropes are happy -- indeed, delighted -- to store very long strings inside a single tree node when they have the chance, because their goal is just to amortize mutation operations, not to provide efficient access by semi-arbitrary index rules.
UTF8 vs UTF32
UTF8 vs UTF32
UTF8 vs UTF32
UTF8 vs UTF32
I've never actually heard of a good reason for anyone to want
that.Iterators vs indices
Most languages (and SQL in particular) work exclusively using string
indexes. You cannot use an iterator in a functional context. SQL doesn't
have string iterators and never will. Iterators are a procedural thing.
Numeric indices are the only language independent option available for
specifying and extracting parts of strings. That is why they are universal.
I wouldn't use a language that didn't support them, in large part due to
the difficulty of translating idioms from languages that do.
As a consequence, anything that needs to be done to efficiently present a
string of any reasonable size as a linear array of _characters_ (or at
least code points) is the language's problem, not the programmer's. That is
the approach SQL takes, and from an application programmer's point of view
it works very well.
That is not to say that an iterator interface shouldn't be provided (in
procedural languages) as well, but of the two an index based interface is
more fundamental.
I think you must be using a funny definition of functional. There is absolutely nothing that
prevents an iterator-based API from working in a functional language. And of course most
functional languages have many such APIs.
Iterators vs indices
Iterators vs indices
Iterators vs indices
Iterators vs indices
Iterators vs indices
Iterators vs indices
Iterators vs indices
Iterators vs indices
Iterators vs indices