LWN.net Logo

Upcoming PHP release will offer Unicode support (Linux.com)

Linux.com takes a look at PHP 6.0. "Andrei Zmievski is one of the leading developers of the PHP programming language. Since March 2005, he has been working with about 20 other developers to add Unicode support to version 6.0 of PHP. Now their efforts are nearing an alpha release."
(Log in to post comments)

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 1, 2007 20:07 UTC (Thu) by tekNico (subscriber, #22) [Link]

So *come* *on*!

Why should anybody use the horrible hacks that are PHP and MySQL, when we have Python, Ruby, PostgreSQL and SQLite?!?

I don't really understand this world sometimes...

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 1, 2007 20:25 UTC (Thu) by bronson (subscriber, #4806) [Link]

Worse is better!

Both PHP and MySQL, for trivial workloads, tend to be faster than their competiton. And most websites are trivial workloads.

I can write a PHP file and host it just about anywhere. It takes like a minute to install Wordpress under /me/blog. Host a Rails or Django site takes some decidedly non-trivial configuration, especially on shared hosting. Ever wrestled with fastcgi? It ain't fun. And good luck trying to host Rails or Django anywhere other than on a virtual domain!

So, while I agree with you (most of my web work is in Rails these days), I do understand why so many people still like the inconsistent insecure disaster known as PHP.

One more reason...

Posted Mar 1, 2007 20:43 UTC (Thu) by pizza (subscriber, #46) [Link]

...legacy code.

There's tons of PHP code already out there, and rewriting it in some other language will take a great deal of time (and money), not to mention most likely introduce boatloads of new bugs in the process.. if you ever finish.

What's important (to the end-user) is what the code does, not how it does it..

I maintain a pile of PHP code, and while I often dream of rewriting it in mod_perl, the reality is that I have much better things to do with my time, so I slowly add new features and refactor/rewrite old code as needed.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 4:56 UTC (Fri) by Holmes1869 (guest, #42043) [Link]

Very much agree with the point about PHP being hosted everywhere. That is very key to its acceptance because people (for some strange reason ;) don't feel like hosting their own sites. PHP has inertia going for it and that will help it stay ahead of others for some time.

As far as the comment about Rails being difficult to configure, I do agree with that. It's easy to use with the built-in web server (props to the Rails folks for including it), but it took me a couple tries before I finally got it up and running with Apache. On the other hand, setting up and using standard .rhtml/mod_ruby/Apache is extremely simple (as simple as PHP) with most modern GNU/Linux distros.

PHP isn't so bad, but I honestly think that languages like Ruby would do extremely well with web developers if there was more hype/marketing, and no I'm not talking about using Rails (tons of hype), but just plain old Ruby. A few of my relatively non-technical friends (business majors) use PHP on a daily basis, but have never even heard of Ruby. PHP seems to almost be synonymous with HTML these days. Better than ASP I guess.

And as far as Unicode support, I don't think Ruby has support either (something about the Japanese not liking Unicode? just something I heard). I could really care less about Unicode, but that's cause I'm a filthy American. Expanding internationalization is a good thing though, so cheers to the good folks in PHP-land that are making this possible.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 11:20 UTC (Fri) by gimp (guest, #43821) [Link]

Quoted from Holmes "(something about the Japanese not liking Unicode? just something I heard)"

This is correct. It's because of the way letters are mapped in UTF-8 (latin letters only take up 1 byte, while japanese often take 3 or 4 bytes). Today this is less & less of a problem because bandwidth and storage are cheap, but it still is a issue for unicode to become world standard.

In japan the encoding Shift-JIS are more popular.
http://en.wikipedia.org/wiki/Shift-JIS

And in china, BIG-5 are more popular.
http://en.wikipedia.org/wiki/Big5

Unicode vs UTF-8

Posted Mar 2, 2007 23:23 UTC (Fri) by ldo (subscriber, #40946) [Link]

Quoted from Holmes "(something about the Japanese not liking Unicode? just something I heard)"

This is correct. It's because of the way letters are mapped in UTF-8 (latin letters only take up 1 byte, while japanese often take 3 or 4 bytes).

Unicode != UTF-8. There are other, less Roman-biased ways to encode Unicode, such as UTF-16, UCS-2 or UCS-4.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 4, 2007 13:15 UTC (Sun) by Cato (subscriber, #7643) [Link]

Quite a lot of Japanese sites now use UTF-8, and it works fine for Japanese, Chinese and so on. See the first few comments on http://twiki.org/cgi-bin/view/Codev/InternationalisationUTF8 for some more on this. Now that web browsers have good UTF-8 support there's little reason not to use Unicode.

If storage on the server is a big issue you can simply use UTF-16 (double-byte for most characters, with a few very rare ones such as Mongolian and Tibetan, and some esoteric Chinese characters, in four bytes) - it's trivial to re-encode this to/from UTF-8 for interaction with browsers, or for internal processing, and most browsers support UTF-16 directly anyway.

As a developer of I18N support for a web app (TWiki), I strongly discourage use of Shift-JIS, GB2312 and most other East Asian double-byte character sets - they are not 'ASCII safe', meaning that a program that is searching for a normal ASCII character can find it embedded within a double-byte character, which is a recipe for occasional data corruption that can be hard to find. You will only find this in your software's test cases if you specifically know about this issues (or you are very lucky!).

Perl support for Unicode is not bad these days but there are still some performance issues - it helps if you can run web apps in mod_perl or similar, but luckily that's not too hard even for web hosted servers, thanks to virtual private servers using UML, Xen, etc. TWiki has some guidelines at http://twiki.org/cgi-bin/view/Codev/InternationalisationG... on writing I18N-aware code, which are somewhat TWiki specific but might be interesting to web application developers - one important point is to centralise your regexes as people will otherwise write [A-Z] everywhere, which won't work for Unicode or even 8-bit character sets such as ISO-8859-1 and KOI8-R (Cyrillic). There's also some discussion on how to do true Unicode support - anyone who feels like doing some Perl development is very welcome to contribute and maybe learn about Unicode in the process.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 5, 2007 11:47 UTC (Mon) by siesel (subscriber, #5021) [Link]

> And in china, BIG-5 are more popular.
GBK is used in China.
BIG-5 is used in Taiwan.

Still Taiwan is named "Republic of China" but that's another topic. :)

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 6:59 UTC (Fri) by drag (subscriber, #31333) [Link]

Ah. So much MySQL hate in this world...

It's a bit funny that larger and larger entities are adopting Mysql (for a lot more then just websites) while people go on and on about how much more superior postgresql is.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 11:32 UTC (Fri) by arcticwolf (guest, #8341) [Link]

Yes... because everything used by "larger and larger entities" has to be good. *coughwindowscough*

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 11:55 UTC (Fri) by drag (subscriber, #31333) [Link]

No, but MySQL definately isn't Windows.

The reason PHP sucks is because it's principally used in internet-facing applications were security is paramount and it security is a cronic problem. Personally I don't give a crap much of a 'hack' it is or how ugly it is or anything like that as long as it does it's job.

Mysql doesn't have that problem.

All I know about Postgresql is that every single f-ng time I see it MySQL mentioned anywere in any context is that there are a few people that happily proclaim in no uncertain terms how much MySQL blows and how everybody should automaticly switch over to postgresql unless they are morons.

Oh that and how MySQL licensing is expensive.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 16:09 UTC (Fri) by pizza (subscriber, #46) [Link]

> The reason PHP sucks is because it's principally used in internet-facing applications were security is paramount and it security is a cronic problem.

That's very true -- the vast majority of vulnerabilities in PHP applications are because of sloppy coding rather than any inherent problems in the language/libraries themselves. A lack of handholding and the ability to do stupid things isn't a bug, but a feature. :)

> Personally I don't give a crap much of a 'hack' it is or how ugly it is or anything like that as long as it does it's job.

Perhaps that's why I love perl so much. :)

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 4, 2007 22:08 UTC (Sun) by dvdeug (subscriber, #10998) [Link]

I fail to see why getting more boxes rooted is a feature. Everyone thinks they're a great programmer, and yet there still seems to be all these bugs that make the Internet a much more dangerous place. You can always write a CGI script in C if you need to do stupid stuff without handholding. A programming language for web programming should, above all, prevent the system from being hacked.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 9, 2007 17:34 UTC (Fri) by cdmiller (subscriber, #2813) [Link]

A programming language doesn't prevent a system from being cracked, people prevent a system from being cracked...

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 10, 2007 1:08 UTC (Sat) by bronson (subscriber, #4806) [Link]

True. But magic quotes and register globals seem to make it a lot harder for people to prevent the system from getting cracked.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 14, 2007 19:18 UTC (Wed) by dvdeug (subscriber, #10998) [Link]

It's amazing how fast people who would fight to the death about the better language will deny any responsibility to a language when problems appear.

The Internet Worm, the first one, depended on a flaw in a program written in C. Only in C (or possibly assembly) would a function like "gets" be used, a function that required you to trust the other side not to overflow the buffer. In many languages, the programmer would have to go out of her way to write such code.

Not only that, I've written in several different languages, and I know I write different code in different languages. I write what comes easy and natural to the language; I create different integer based types in Ada because it's easy and natural to do so, but would never go through the hoops to do so in C++. I use pointers and OO in Java where I would use the stack and structs in C++ or Ada. If a language makes it easy and natural to write safe code, programmers will do so; if a language makes it complex and unnatural to write safe code, programmers will write the unsafe code (e.g. gets) that is simple and natural.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 17:26 UTC (Fri) by k8to (subscriber, #15413) [Link]

The thing that's really hateful about mysql, if you look past the fanboys with various views of what-a-database-must-be which don't necessarily correspond to your use-case, is that it has all kinds of unsafe behaviors by default. It loves to do things like null-out fields or data it doesn't like and commit the modification anyway. I don't see how anyone who has spent significant time developing against it can trust it.

If you only use it one or more step removed, via a package which uses mysql, then you don't necessarily have to care. But still I won't install it on my systems.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 3, 2007 11:46 UTC (Sat) by arcticwolf (guest, #8341) [Link]

MySQL does have its share of problems, but that's beside the point... I just wanted to point out that your argument that "MySQL is used by big corps, therefore it must be good" is flawed, no matter whether MySQL actually is good or not.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 8, 2007 4:38 UTC (Thu) by drag (subscriber, #31333) [Link]

I say 'If mysql is so bad and unreliable then why does it get choosen more for larger and more important tasks then postgresql?'

You say 'Popularity is not a good gauge of technical superiority look at Windows vs Linux'.

So what your saying is that MySQL is more popular for larger stuff because it's more popular?

It's not like there is a huge amount of lock in associated with postgresql vs mysql when your starting off with a new database. I figure a lot of the rational behind Windows' continued dominance of the desktop isn't going to apply to Mysql vs Postgresql.

I am just asking 'Why'. If Postgresql is superior to Mysql then why do the sorts of people that should know the most about this sort of stuff are the ones that are choosing Mysql?

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 8, 2007 4:41 UTC (Thu) by drag (subscriber, #31333) [Link]

Or am I just plain mistaken with the choices adopters of open source software for the larger/enterprise-ish databases?

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 12:44 UTC (Fri) by epa (subscriber, #39769) [Link]

Isn't SQLite even more horrible than MySQL? MySQL always used to get flamed for accepting 'February 30th' as a valid date... but that is almost reasonable compared to accepting 'hello' as an integer. Foreign keys are silently ignored, indeed it seems almost none of the data integrity you expect a database to provide is there with SQLite. Embedded Firebird sounds like a much better choice.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 20:56 UTC (Fri) by khim (subscriber, #9252) [Link]

Foreign keys are silently ignored, indeed it seems almost none of the data integrity you expect a database to provide is there with SQLite.

Ditto for MySQL. The key words are "you expect a database to provide". If you don't "expect a database to provide" useless bells and whistles in all cases and accept that it's your responsibility to care about data - then MySQL and SQLite are great. I can say this as both developer and user of large-scale systems. If you do expect to find capabilities of "real database" in any SQL server - you will be disappointed.

Not a problem of MySQL and/or SQLite really - just misuse of it.

Sometimes you do need support for foreign keys, for transactions and so on - and then MySQL or SQLite are truly poor choices, but in 90% cases when I see someone complain that "MySQL is evil since it does not catch my errors" it's just an excuse to write sloppy code, nothing more.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 3, 2007 19:43 UTC (Sat) by Los__D (subscriber, #15263) [Link]

You of course mean MySQL with MyISAM tables, MySQL with InnoDB tables has full foreign key and transaction support.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 3, 2007 1:47 UTC (Sat) by nlucas (subscriber, #33793) [Link]

SQLite has no intention of replacing any of the other databases. It is a *light* SQL library, not a real database.

If you use it when you need it and for the right tasks (and not because it's the only thing you have or know) it does an excelent job.

The SQLite author likes to say SQLite is an fopen() replacement, not a MySQL, Postres or any other *real* database substitute.

SQLite vs. MySQL

Posted Mar 6, 2007 7:37 UTC (Tue) by njs (subscriber, #40338) [Link]

>Isn't SQLite even more horrible than MySQL?

Definitely not. It's true that SQLite doesn't support foreign keys out of the box -- though you can add them manually if you want[1] -- and has an unusual approach to types. (You might be able to add column-based type checks by hand too, I'm not sure.) If I were designing SQLite, it would definitely have static typing.

But, while these design choices are somewhat odd, they're valid and well-documented, and the resulting system is very simple and predictable. The key word is "predictable" -- there is *nothing* more important for a database than predictability, because only predictability gives you a solid foundation to build your actual app on. And in addition, SQLite's author hits all the right points everywhere else -- >95% test coverage including coverage of all edge cases, robust transaction handling, etc. etc. It's not suitable for all uses, but that's because it doesn't try to be, not because it will suddenly blow up in your face at a critical moment -- it's extremely good at what it does.

Compare to MySQL, where you *have* e.g. type checking for dates, but you can't actually rely on it. (Or didn't use to be able to.) And you might have transactions that work, or you might not -- hope you didn't accidentally use the wrong database type! And you have a developer team that has in the past hit all kinds of wrong notes, basically sending the message that they didn't think these problems were important.

I can't personally verify that there are no race conditions in my database's code to do rollback recovery after a power outage -- it's almost impossible to do. I just have to trust that the developers spent a few months bashing their face against the wall thinking of all the edge cases and accounting for them. The biggest problem with MySQL was always that you just couldn't trust the team to be killing themselves to achieve robustness. D. Richard Hipp, OTOH, has consistently earned that trust. It's all the difference in the world.

[1] http://www.sqlite.org/cvstrac/wiki/wiki?p=ForeignKeyTriggers

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 16:42 UTC (Fri) by frazier (subscriber, #3060) [Link]

I really like PHP. Most of the security problems have been on the program ends (phpBB comes to mind). Sure, the systems by default were asking for people to write insecure code (especially with variable handling) but if you have an idea what you're doing, PHP lets you get it done with lots of advantages:

1. Tons of available code that does what you need or something really close. I find that too much to compare is a common problem. There's that much free code out there. I like that if you find something that does 95% of what you want you can mod it out to taste. Some of it is total crap, but you take a look at it and run away! You at least got to see the code and know you were looking at trash.

2. Built for purpose. Doesn't take a lot of code to get things done. Sending an email is a one liner. Less code is quicker to develop and I find at least in my instance easier to maintain.

3. Lots of hosting options. I many times don't even need a dedicated server. I like maintainence and redundant connections/power that you get from a data center. PHP has this behind it. Oh, and it is CHEAP!

4. On a personal level, I have lots of hours in PHP so I have a reasonable grasp of how to get things done with it quickly.

5. It's pretty easy to have a local instance running on Windows if you need it. I only had one computer at my last job and it was Windows (for photoshop) and I could run PHP on it. (Worth noting, I don't do this at home. I'll have a dedicated server at home.)

...and that's off the top of my head.

PHP lets you get things done quickly and inexpensively with lots of flexibility. It's not so bad.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 18:31 UTC (Fri) by job (subscriber, #670) [Link]

While I'm sure you like PHP for a multitude of reasons, it's not really fair to blame the security problems solely on the program end. PHP has had an important security update in PHP itself a couple of times per year, for many years now. That's about as bad as sendmail ever was.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 20:42 UTC (Fri) by frazier (subscriber, #3060) [Link]

I'm not seeing the volume of security bulletins over the wire like sendmail had some years back, but yes, there's been a number of them. Using managed servers lessens the inconvenience to me since I don't have to worry about performing those updates myself.

Maybe PHP will turn a corner like sendmail did. Beats me.

I do know that if I need a web site done in a hurry (and that's pretty much a given) PHP is what I grab for.

-Brock

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 4, 2007 9:02 UTC (Sun) by job (subscriber, #670) [Link]

I don't know where you look, but look at the last release, 5.2.1. There are several fixes for unrelated security issues. Running 5.2.0 obviously left your machine wide open. Then look at the previous release, 5.2.0. Security fixes there as well. And they are not always marked as security fixes in the changelog, which angers me as well, that's not serious behavior from a vendor.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 20:47 UTC (Fri) by bronson (subscriber, #4806) [Link]

And Register Globals and Magic Quotes never should have seen the light of day. The number of web sites defaced by these misfeatures is just staggering. Yes, every time this happens it's technically the programmer's fault. But PHP really shouldn't make it so easy for the programmer to screw up!

Thankfully these will eventually be banished in a new major release. PHP is slowly getting saner as it evolves.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 22:40 UTC (Fri) by tjc (subscriber, #137) [Link]

Why should anybody use the horrible hacks that are PHP and MySQL, when we have Python, Ruby, PostgreSQL and SQLite?!?
I guess it's time to grep for troll-o-meter.txt... oh here it is!
 0   2   4   6   8   10
                  /
                 /
                /
               /
              /
             /
            /
     TROLL-O-METER

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 1, 2007 22:00 UTC (Thu) by ajross (subscriber, #4563) [Link]

Am I to understand that all these years, PHP has *not* supported Unicode? What are the problems?

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 1, 2007 22:34 UTC (Thu) by pizza (subscriber, #46) [Link]

Basically, multi-byte (UTF-16, for example) or even variable-byte characters (UTF-8) greatly complicate things. All code which deals with text manipulation or generation needs to be aware of these encodings.

There is some unicode support in current PHP releases, but it's kinda haphazard rather than being handled as part of the core language/interpreter specification.

(Perl went through a similar sort of pain to support Unicode natively...)

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 1, 2007 22:38 UTC (Thu) by ajross (subscriber, #4563) [Link]

I know how it works, actually. But this is 2007. Major linux distros
have been shipping UTF-8 as the default locale for what, 4 years now?
The time for this was long, long ago.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 13:52 UTC (Fri) by gravious (guest, #7662) [Link]

Hey, you know, if you care all that much I'm sure they're accepting patches; if you don't care so much then ease up on the PHP devs, huh?

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 3, 2007 11:57 UTC (Sat) by arcticwolf (guest, #8341) [Link]

That's a stupid argument, really - you could apply the same to just about any criticism of just about anything.

Unhappy with your car's gas mileage? Build a better car! Didn't like Star Wars episode 1? Make a better movie! Unhappy with the fact that the surgeon who was supposed to fix your broken leg made a mistake? Just do it yourself next time! Don't like your country's foreign policy? Declare independence and form your own country!

I have no idea where the notion that you're not allowed to criticise things *at all* (and that, rather, you must fix them yourself) comes from, but it's really ridiculous. Of course sometimes there's obnoxious people on mailing lists who will rant and rave and whine about the fact that the developers aren't implementing the obscure feature they requested that's useful only to them, and of course those deserve to be ignored (or bitch-slapped, verbally), but expressing dissatisfaction with the fact that a major programming language has no in-core support for Unicode in 2007 is something else entirely.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 11, 2007 21:51 UTC (Sun) by ekj (subscriber, #1524) [Link]

So, you're saying: It's inapropriate to ever critizice any Open Source program, EVER

Atleast that's the logic of your statement; either you contribute a patch (and refrain from critique), or you obviously don't care (and are thus barred from critique)

I guess that wasn't how you meant it, but it sure was how it sounded. Better try again.

Not supporting unicode in year 2007 is a *perfectly* legitimate complaint against a programming-language.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 3, 2007 18:03 UTC (Sat) by tjc (subscriber, #137) [Link]

The time for this was long, long ago.
And yet there are still millions of computer users who can get by nicely with just the ISO-8859-1 character set.

Upcoming PHP release will offer Unicode support (Linux.com)

Posted Mar 2, 2007 11:15 UTC (Fri) by gimp (guest, #43821) [Link]

PHP has supported unicode and many other encodings for a long time thru the official extension php_mbstring.
Documentation of mbstring is here: http://php.net/manual/en/ref.mbstring.php

The big difference with PHP 6 as far as I understand, is that this extension is removed, and the features are moved into the core.

This means that all string functions within PHP will be unicode aware.
Existing scripts should work as before (unless they believe that 1 character = 1 byte in some situations).

This is good for various reasons, for example if you now develop a PHP 6-compatible application, you know that the host will support these encodings, without any special configuration needed.

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds