LWN: Comments on "Resetting PHP 6"

Resetting PHP 6

nivas — Wed, 15 Jun 2011 07:16:12 +0000

Hi, When it will be released?

UTF-16

dvdeug — Mon, 27 Dec 2010 02:01:56 +0000

You're writing in a language with one of the most screwed up orthographies in existence. Convince English speakers to use a reasonable orthography, and then you can start complaining about the rest of the world.

Not only that, some of these scripts you're not supporting are wonders. Just because Arabic is always written in cursive and thus needs complex script support, doesn't mean that it's not an alphabet that's perfectly suited to its language, that is in fact easier to learn for children, then the English alphabet is for English speakers.

Supporting Chinese or Arabic is like any other feature. You can refuse to support it, but if your program is important, patches or forks are going to float around to fix. Since Debian and other distributions are committed to supporting those languages, the version of the program that will be in the distributions will be the forked version. If there is no fork, they may just not include it. That's the cost you'll have to pay for ignoring the features they want.

If I had mod points, I'd give you one.

qu1j0t3 — Thu, 15 Apr 2010 10:45:15 +0000

Well said.

McLuhan

qu1j0t3 — Thu, 15 Apr 2010 09:27:07 +0000

Anyone who wants to explore the topic of comparative alphabets further may find McLuhan's works, such as The Gutenberg Galaxy, rewarding.

Resetting PHP 6

spitzak — Wed, 31 Mar 2010 17:49:31 +0000

I strongly agree with Forth's solution. The postscript paper describes exactly how easy it was to use UTF-8 if you stop panicking about "characters" and realize that they are just like words and nobody worries that you can't find the ends of words in O(1) time. The listing of the number of lines changed should be very instructive. I hope everybody saying I am wrong might read the paper.

Forth's solution appears to have an interator return an object that they call an "xchar" which is a Unicode code point. I believe such an object is easily extended to return "UTF-8 encoding error" as a different value. You can also make different iterators to return composed or decomposed characters, and to automatically convert UTF-8 errors to CP1252 equivalents, which (though unsafe) will remove any need to "identify the character encoding" since this will reliably recognize UTF-8, ISO-8859-1, and CP1252 automatically, even if variations are pasted together.

Resetting PHP 6

anton — Wed, 31 Mar 2010 16:50:51 +0000

Strings should be UTF-8 and string[n] should return the n'th byte in the string. That is the TRUTH and Microsoft and Python and PHP and Java and everybody else is WRONG.

I guess Forth does not belong to "everybody else", then, because we are going in the direction you suggest. The ideas are probably best explained in an early paper, but if you want to know where this went, look at the current (frozen) proposal.

Resetting PHP 6

Darkmere — Wed, 31 Mar 2010 09:51:49 +0000

Indeed, and this makes me quite sad. Because really, it feels as if Perl slipped off the map and into la-la-land. Not of the Duke Nukem Forever-style, but by setting the system up to a situation where you cannot deliver Perl6, because it's some immaterial beast that has yet to be able to exist.

Resetting PHP 6

roerd — Wed, 31 Mar 2010 08:48:15 +0000

> Rakudo is something different, a Perl-like language, perhaps a steppingstone for future Perl technology. But it isn't Perl 6.0 to this member of the audience. It is Rakudo. Not Perl.

By that definition there will never be a Perl 6.0, because Perl 6 is a specification, not an implementation. Though of course you're right that at this time Rakudo can't be an implementation of Perl 6.0, because the specification is still a moving target.

UTF-16

j16sdiz — Wed, 31 Mar 2010 04:35:13 +0000

> > Are you seriously claiming that top-to-bottom is "the clearly preferable" writing mode for modern Chinese
speakers because that's what you saw being used in a restaurant menu?

> It may not be "clearly preferable", but it certainly is still widely used at least in Hong Kong, Taiwan and Japan.
Just go to any bookstore or newspaper stand in these three places and see for yourself.

As a Chinese living in Hong Kong I can tell you this:
Most of the Chinese characters are in BMP. Some of those outside BMP are used in Hong Kong, but they are not
as important as you think -- most of them can be replaced with something in BMP (and that's how we have been
doing this before the HKSCS standard)

And yes, you can have Confucius in BMP. (Just like how you have KJV bible in latin1 -- replace those long-S
with th, and stuff like that)

Iterators vs indices

njs — Wed, 31 Mar 2010 04:30:52 +0000

> I understood the poster to mean using pointers for individual characters (how else can you do inserts at any point in the string without having to know how it's structured)

I'm afraid I don't understand at all. I *am* that poster, and the data structure I described can do O(log n) inserts without pointers to individual characters. Perhaps I am just explaining badly?

Iterators vs indices

dlang — Wed, 31 Mar 2010 02:05:22 +0000

I understood the poster to mean using pointers for individual characters (how else can you do inserts at any point in the string without having to know how it's structured)

google wave uses the jabber protocol, but in it's documents it doesn't store words, it stores the letters individually, grouped togeather so that they can be changed individually (or so it was explained by the google rep giving the presentation I was at)

Iterators vs indices

njs — Tue, 30 Mar 2010 17:06:07 +0000

No, interestingly -- they are more complicated and less like a conventional tree structure than one would think: http://www.sgi.com/tech/stl/ropeimpl.html

The most important difference is that ropes are happy -- indeed, delighted -- to store very long strings inside a single tree node when they have the chance, because their goal is just to amortize mutation operations, not to provide efficient access by semi-arbitrary index rules.

Iterators vs indices

nix — Tue, 30 Mar 2010 08:12:31 +0000

Didn't the GNU C++ ext/rope work in exactly this way?

Iterators vs indices

njs — Tue, 30 Mar 2010 07:46:45 +0000

> the biggest reason nobody stores strings that way is the overhead. it requires many pointers which end up making UTF-32 look compact by comparison.

The memory overhead is certainly not as high as UCS-32 (at least for strings where UTF-8 has lower overhead than UCS-32 to start with) -- you need something like 3*log_2(n) words of overhead, but n is the number of "chunks", not bytes, and a reasonable chunk-size is in the hundreds of bytes, at least. Within a chunk you revert to linear behavior, but that's not so bad, IIUC on modern CPUs linear-time is not much worse than constant-time when it comes to accessing short arrays.

Most strings are short, and with proper tuning they'd probably fit into one chunk anyway, so the overhead is nearly nil.

But you're right, there is some overhead -- not that this stops people from using scripting languages -- and a lot of tricky implementation, and simple solutions are often good enough.

I don't understand what you mean about Google Wave, though. A) Isn't it mostly a protocol? Where do string storage APIs come in? B) It's exactly the non-trivial uses -- where you have large, mutable strings -- that arrays and linear-time iteration don't scale to.

Iterators vs indices

dlang — Tue, 30 Mar 2010 07:11:52 +0000

the biggest reason nobody stores strings that way is the overhead. it requires many pointers which end up making UTF-32 look compact by comparison.

besides, as noted earlier in this thread, most uses of strings really don't care how they break apart, they are almost always used as-is (or at most with one step of parsing, usually on whitespace, on input) as such, anything more than the most compact representation ends up costing significantly more in memory size (and therefor cache space) than you gain with any string manipulation that you do

Google Wave actually stores strings the way you are suggesting, or did when I saw the presentation on it last year, but I think that doing so will keep it from being used for anything beyond trivial uses.

Iterators vs indices

njs — Tue, 30 Mar 2010 06:57:21 +0000

Another option is to store a string as a tree structure, where the leaves are some reasonable-sized chunks of bytes (to amortize storage overhead), and the tree nodes are annotated with the number of characters/bytes/code points/lines/whatever that occur underneath them. This allows random O(log n) access by character/byte/... offset. (You can maintain several different sorts of counts, and get fast access for all of them in the same data structure.) You also get cheap random insertion/deletion, which is an important operation for some tasks (e.g., editor buffers!) but horrendously slow for arrays.

For some reason nobody does this, though.

UTF-16

paulj — Sun, 28 Mar 2010 04:22:54 +0000

Good point. :)

This was a Han chinese person from north-eastern China, i.e. someone from
the dominant cultural group in China, from the more developed part of China.
I don't know how representative their education was, but I suspect there's
at least some standardisation and uniformity.

Backwards compatibility

man_ls — Sat, 27 Mar 2010 22:53:05 +0000

The big difference between Perl 6, PHP 6, and Python3 is that Python3 is out right now, avialable, has a bunch of transition tools, code is somewhat backwards compatible to 2.6, and it's had a couple stablizing releases.

But "somewhat backwards compatible" is not good enough. For any non-trivial applications you still need to test everything again, and probably do some coding + testing + deploying. In business settings it translates to money and pains; in volunteer projects just pains.

Even when backwards compatibility is a requirement, like for Java (where the rare breakages are clearly signaled and known by everyone), testing time for new versions has to be allocated. With Python migrations are a showstopper for most people unless the new version somehow provides great advantages (which for me it doesn't). For developers of the language itself and the runtime, the supposed benefits of not having to be backwards compatible are probably offset by having to support two or three versions indefinitely.

UTF-16

man_ls — Sat, 27 Mar 2010 22:39:42 +0000

China has 1,325,639,982 inhabitants, according to Google. That is more than the whole of Europe, Russia, US, Canada and Australia combined. Even if there is a central government, we can assume a certain cultural diversity.

Resetting PHP 6

bronson — Sat, 27 Mar 2010 20:06:54 +0000

There's a difference between "available for use as an experiment" and "available for use as Perl." If perl.org doesn't link to Perl6 from its home page, then one would guess that Perl6 isn't available for general use.

And one would be right.

No need to get all insulty with big shiny download buttons.

Resetting PHP 6

HelloWorld — Sat, 27 Mar 2010 12:13:12 +0000

It wasn't. On
http://www.shacknews.com/onearticle.x/61747
it says:
"we've never said that Duke Nukem Forever has ceased development,"

Resetting PHP 6

jra — Sat, 27 Mar 2010 00:52:44 +0000

Hear hear. I merged in the original wide character support for Samba, done by the Japanese. Eventually we moved to a utf8-based solution (coded by tridge, naturally :-) with iterators for manipulating the strings. It's the only thing that makes sense.

Jeremy.

Resetting PHP 6

cmccabe — Sat, 27 Mar 2010 00:37:00 +0000

ln -s /usr/bin/ruby /usr/bin/php6

Problem solved; who's up for lunch?

Iterators vs indices

spitzak — Fri, 26 Mar 2010 22:09:59 +0000

If you really can't get away from the integer index, a solution is to have a string format that stores the most recent index computed and where it was in the string. Then when asked for a new index it will move from that previous position to the new one if the previous position is less that 2x the new one.

For the vast majority of cases where each integer starting from zero is used to get the "character" this would put the implementation back to O(1). And it would allow more complex accessors, such as "what error is here".

UTF-16

paulj — Fri, 26 Mar 2010 21:24:43 +0000

Yes, I gather formal pinyin has accents to differentiate the tones, but on a
computer you just enter the roman chars and the computer gives you an
appropriate list of glyphs to pick (with arrow key or number).

And yes they are. Shame there's much misunderstanding (in both directions)
though. Anyway, OT.. ;)

UTF-16

spacehunt — Fri, 26 Mar 2010 19:31:06 +0000

I'm a native Cantonese speaker in Hong Kong, hopefully my observations would serve as useful reference...

> 25 centuries of linguistic evolution separate us from Confucius. Suppose you can display all the ancient characters properly; how much would that really help a modern Chinese speaker understand the meaning of the text? Does knowing the Latin alphabet help a modern French speaker understand text written in Classical Latin?

A lot of Chinese characters in modern usage are outside of the BMP:
http://www.mail-archive.com/linux-utf8@nl.linux.org/msg00...

> Are you seriously claiming that top-to-bottom is "the clearly preferable" writing mode for modern Chinese speakers because that's what you saw being used in a restaurant menu?

It may not be "clearly preferable", but it certainly is still widely used at least in Hong Kong, Taiwan and Japan. Just go to any bookstore or newspaper stand in these three places and see for yourself.

Resetting PHP 6

chromatic — Fri, 26 Mar 2010 19:05:07 +0000

Aren't you making an ontological argument (Perl 6 doesn't exist, because it hasn't been released, because the text on a website says that Perl 5.10.1 is the current version of Perl) based on a definitional fallacy (you will believe that Rakudo is a Perl 6 implementation when the text on a specific website changes)?

Perl.com didn't mention Perl 5.10.1 for several months. Which has precedence, perl.org or perl.com? Which has precedence with regard to Perl 6, perl.org or perl6.org?

I can understand that you don't want to download or use a Perl 6 implementation such as Rakudo until it meets certain criteria, and I can understand that a big shiny Download Now button is such a criterion for certain classes of users, but I don't understand how an HTML change to add a download button somehow flips the switch from "The software does not exist as its developers claim it does" to "Oh, now it really exists," at least for a project which isn't itself solely a download button.

UTF-16

chuckles — Fri, 26 Mar 2010 15:37:04 +0000

I'm in China right now learning Mandarin so I can comment on this. Children learn pinyin at the same time as the characters. The Pinyin is printed over the characters and is used to help with pronunciation. While dictionaries targeted towards little children and foreigners are indexed by pinyin, normal dictionaries used by adults are not. Dictionaries used by adults are indexed by the radicals.
While pinyin is nice, there are no tone markers. So you have a 1 in 5 chance (4 tones plus neutral) of getting it right.
You are correct that pinyin is the input system on computers, cell phones, everything electronic, in mainland china. Taiwan has its own system. Also, Chinese are very proud people, Characters aren't going anywhere for a LONG time.

Resetting PHP 6

marcH — Fri, 26 Mar 2010 14:41:05 +0000

Yes: UTF-8 is a brilliant backward-compatibility hack that allows software developers to offload their homework to someone else later down the road. It's a truly admirable hack.

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

I guess I'm just not so cavalier....

foom — Fri, 26 Mar 2010 13:51:28 +0000

I've kinda wondered how exaggerated this problem is. I mean, the default on windows is threaded -
- do most modules blow up by default on windows? That seems like a problem that their authors
would want to fix.

Iterators vs indices

foom — Fri, 26 Mar 2010 13:49:52 +0000

I think you must be using a funny definition of functional. There is absolutely nothing that prevents an iterator-based API from working in a functional language. And of course most functional languages have many such APIs.

Let's take a traditional example: singly-linked-lists are a quite common data-structure in functional (or mostly-functional) languages like Haskell, Scheme, etc. Yet, you don't index them by position (that of course is available if you need it, but it's time O(n), so you don't normally want to use it). Instead, you use an iterator, which in this case is a pointer to the current element.

If anyone suggested that the primary access method for a singly linked list should be by integer position, they'd be rightly told that's insane -- iterating over the list would take O(n^2)!

Now, maybe your real point was simply that existing languages already have a poorly-designed Unicode String API that they have to keep compatiblity with -- and that API doesn't include iterators. So, they therefore have constraints they need to preserve, such as O(1) access by character index, because existing programs require it.

I won't argue with that, but I still assert it's not actually a useful feature for a unicode string API, in absence of the API-compatibility requirement.

Resetting PHP 6

Darkmere — Fri, 26 Mar 2010 13:20:58 +0000

I'll believe it when I can go to perl.org and see "current version" being something other than 5.xx.x , perhaps even perl 6.0.0.

Until then, Perl is at 5.x.

Rakudo is something different, a Perl-like language, perhaps a steppingstone for future Perl technology. But it isn't Perl 6.0 to this member of the audience. It is Rakudo. Not Perl.

Resetting PHP 6

ringerc — Fri, 26 Mar 2010 12:29:09 +0000

1. Much faster due to no need to translate on input/output

... if the surrounding systems to which I/O is done (the file system, other library APIs, network hosts, etc) are in fact using a utf-8 encoding themselves. Alas, even on many modern systems non-utf-8 encodings are very common.

2. Able to use existing apis to name files and parse text, rather than having to make an all-new api that takes "wide characters".

Not safely. The use of existing APIs with new encodings is a HUGE source of bugs in software. I've wasted vast amounts of time tracking down and fixing cases where software fails to do external->internal encoding conversion on input, fails to do internal->external encoding conversion on output, converts already-converted data (mangling it horribly by re-interpreting it as being in the wrong encoding), etc. Using utf-8 with existing encoding-agnostic APIs is a blight on software engineering. Any API should take either a properly typed argument that's specified to ONLY hold text of a known encoding - possibly single fixed encoding like utf-8, or possibly a bytes+encoding tuple structure. If it takes a raw "byte string" it should take a second argument specifying what encoding that data is in.

The fact that POSIX file systems and APIs don't care about "text" with known encoding, only "strings of bytes", is an incredible PITA. Ever had the fun of backing up a network share used by multiple hosts each of which like to use different text encodings? Ever then had to find and restore a single file within that share without knowing what encoding it was in and thus what the byte sequence of the file name was, only the "text" of the file name? ARGH.

"wide" APIs are painful, but they're more than worth it in the bugs and data corruption they prevent.

That's not to say that UTF-16 is better than UTF-8 or vice versa. Rather, "single known encoding enforced" is better than "it's just some bytes".

UTF-16

mpr22 — Fri, 26 Mar 2010 11:16:47 +0000

I can read Chinese writings from the 1st Century; can you use today's English spellings or words to read English writings from the 13th Century?

13th Century English (i.e. what linguists call "Middle English") should be readable-for-meaning by an educated speaker of Modern English with a few marginal glosses. Reading-for-sound is almost as easy (95% of it is covered by "Don't silence the silent-in-Modern-English consonants. Pronounce the vowels like Latin / Italian / Spanish instead of like Modern English").

My understanding is that the Greek of 2000 years ago is similarly readable to fluent Modern Greek users. (The phonological issues are a bit trickier in that case.)

In both cases - and, I'm sure, in the case of classical Chinese - it would take more than just knowing the words and grammar to receive the full meaning of the text. Metaphors and cultural assumptions are tricky things.

Resetting PHP 6

ikm — Fri, 26 Mar 2010 09:52:01 +0000

It was officially cancelled.

Resetting PHP 6

ikm — Fri, 26 Mar 2010 09:51:18 +0000

> First of all, the primary thing that happens in real programs is that the halves of the string get pasted back together

No, your example doesn't count -- this isn't string splitting, your resulting strings are intact there. The primary thing that happens in real programs is that they try to shorten the string, e.g. make "A very long string" into something like "A very lo...", to squeeze it in e.g. a fixed space of 12 characters, or do similar transformations. Those transformations can't be done correctly on raw 8-bit utf-8 strings.

> why is breaking a "character" really such a disaster? Why are we not worried about breaking "words"?

Because you're breaking the underlying encoding of the characters, not the characters itself. The resulting bitstream would be an invalid utf-8 sequence. Parts of english words you split would be rendered intact just fine, but damaged and invalid utf-8 would either result in no display at all, or in program/library barf. You can safely combine valid utf-8 sequences together, but you can't arbitrarily cut them and expect the result to be valid.

> Worrying about "breaking characters" is actually stupid, and is being used as an excuse to defend the bone-headed decision to use "wide characters".

As a Russian, I actually know how important this is. I've seen enough non-utf8 aware programs and observed enough of their horrendous problems to understand the importance of wide characters. What makes you so bold in your statements? You seem to know nothing about the topic.

Resetting PHP 6

spitzak — Fri, 26 Mar 2010 04:11:18 +0000

No, UTF-8 is preferable.

The truly unavoidable technical reason is that only UTF-8 can safely encode UTF-8 errors. Lossless transmission of data is a requirement for safe and bug-free computing.

Other reasons:

1. Much faster due to no need to translate on input/output

2. Able to use existing apis to name files and parse text, rather than having to make an all-new api that takes "wide characters".

3. Often enormously simpler as error detection can be deferred until the string is interpreted.

4. If errors are preserved until display, they can be replaced with more user-friendly replacements (such as the ISO-8859-1 for each byte). This is not safe if errors must be replaced as part of data processing.

5. High-speed byte-based search algorithms work. Tables used by these would go up in size by a factor of 256^3 or more if they were rewritten to use 16-bit units.

5. For almost all real text files UTF-8 is shorter than UTF-16. This is not a big deal but some people think it is important.

Resetting PHP 6

spitzak — Fri, 26 Mar 2010 04:02:54 +0000

You are seriously overestimating the damage of "cutting a string at an arbitrary byte".

First of all, the primary thing that happens in real programs is that the halves of the string get pasted back together, such as when fixed-sized blocks are copied from one file to another. That does not destroy UTF-8 at all.

Second, why is breaking a "character" really such a disaster? Why are we not worried about breaking "words"? If I split a english word in half I will probably get two non-words. How can I possibly safely use a computer language that allows such things? Why it seems hard to believe that word processors could be written when the computer would allow this horrible abilty! /sarcasm

Worrying about "breaking characters" is actually stupid, and is being used as an excuse to defend the bone-headed decision to use "wide characters".

Resetting PHP 6

spitzak — Fri, 26 Mar 2010 03:51:40 +0000

Python 3 is doing the EXACT SAME STUPID MISTAKE. It is going to be a disaster and the developers are too blinded to realize it.

There will be the annoying overhead of converting every bit of data on input and output. But far more important will be the fact that errors in the UTF-8 will either be lost or will cause exceptions to be thrown, producing a whole universe of ugly bugs and DOS attacks. This is going to suck bad!

Strings should be UTF-8 and string[n] should return the n'th byte in the string. That is the TRUTH and Microsoft and Python and PHP and Java and everybody else is WRONG.

But how do I get the N'th character???? You are probably sputtering this nonsense question right now, right? You need to ask yourself: where did "N" come from? I can guarantee you it came from an iterative process that looked at every character between some other point and this new point. The proper interface to look at "characters" is ITERATORS. They can move by one in each direction in O(1) time. And different iterators can return composed or decomposed characters, and if the byte is an error they can clearly return that error and also return suggested replacement values.

Unfortunatly Unicode and UTF and perhaps some kind of politically-correct rule that we can only have equality and world peace if some people don't get the "better" shorter encodings, seems to turn quite intelligent programmers into complete morons. Or more like idiot savants: they are dangerously talented enough to write these horrible things and foist them on everybody.

UTF-16

paulj — Fri, 26 Mar 2010 02:49:51 +0000

I have a (mainland chinese) chinese dictionary here, intended for kids,
and it is indexed by pinyin. From what I have seen of (mainland) chinese,
pinyin appears to be their primary way of writing chinese (i.e. most writing
these days is done electronically, and pinyin is used as the input
encoding).