Resetting PHP 6

Posted Mar 24, 2010 18:35 UTC (Wed) by ikm (guest, #493)
Parent article: Resetting PHP 6

UCS-4 (some call it UTF-32) allows random access to individual code points, but this access isn't really always needed, and the waste is great. UTF-16 has none of the advantages of UTF-8, but all of its disadvantages. It seems logical therefore to operate almost solely on UTF-8. For that, the language should have utf8 string iterators, store string's logical length, and so on. Problem is, to make sure no programmers' errors slip through, one should exclude any support for direct 8-bit string manipulations from it. You may not e.g. be able to cut such strings at arbitrary 8-bit boundaries, and shouldn't even know their 8-bit sizes. The string would then actually feel like an UCS-4 string -- only without random access. This feels quite limiting, but I think that would still be the right approach. If an 8-bit string is needed, there should be ways to convert/project -- but the distinction must be stark. If, on the other hand, direct random access to UCS-4 data is required, the string could temporarily convert itself to UCS-4 under the hood, and then later shrink back to UTF-8.

This would look like the right approach to me.

Resetting PHP 6

Posted Mar 24, 2010 19:26 UTC (Wed) by mrshiny (guest, #4266) [Link]

Just want to point out that a useful language will always have ways for programmers to screw up character encodings. In Java a char is distinct from a byte, and yet people do someString.getBytes("UTF-8") (to get the bytes when utf-8 encoded) then proceed to treat each byte as if it represents a letter. Since you can't take away the ability to write character data into an arbitrary encoding, you can't take away this particular failure mode. Character encodings should be taught in school as an abject lesson in the consequences of data storage decisions.

Resetting PHP 6

Posted Mar 26, 2010 4:02 UTC (Fri) by spitzak (guest, #4593) [Link] (1 responses)

You are seriously overestimating the damage of "cutting a string at an arbitrary byte".

First of all, the primary thing that happens in real programs is that the halves of the string get pasted back together, such as when fixed-sized blocks are copied from one file to another. That does not destroy UTF-8 at all.

Second, why is breaking a "character" really such a disaster? Why are we not worried about breaking "words"? If I split a english word in half I will probably get two non-words. How can I possibly safely use a computer language that allows such things? Why it seems hard to believe that word processors could be written when the computer would allow this horrible abilty! /sarcasm

Worrying about "breaking characters" is actually stupid, and is being used as an excuse to defend the bone-headed decision to use "wide characters".

Resetting PHP 6

Posted Mar 26, 2010 9:51 UTC (Fri) by ikm (guest, #493) [Link]

> First of all, the primary thing that happens in real programs is that the halves of the string get pasted back together

No, your example doesn't count -- this isn't string splitting, your resulting strings are intact there. The primary thing that happens in real programs is that they try to shorten the string, e.g. make "A very long string" into something like "A very lo...", to squeeze it in e.g. a fixed space of 12 characters, or do similar transformations. Those transformations can't be done correctly on raw 8-bit utf-8 strings.

> why is breaking a "character" really such a disaster? Why are we not worried about breaking "words"?

Because you're breaking the underlying encoding of the characters, not the characters itself. The resulting bitstream would be an invalid utf-8 sequence. Parts of english words you split would be rendered intact just fine, but damaged and invalid utf-8 would either result in no display at all, or in program/library barf. You can safely combine valid utf-8 sequences together, but you can't arbitrarily cut them and expect the result to be valid.

> Worrying about "breaking characters" is actually stupid, and is being used as an excuse to defend the bone-headed decision to use "wide characters".

As a Russian, I actually know how important this is. I've seen enough non-utf8 aware programs and observed enough of their horrendous problems to understand the importance of wide characters. What makes you so bold in your statements? You seem to know nothing about the topic.