PHP and P++
PHP and P++
Posted Aug 15, 2019 17:20 UTC (Thu) by NYKevin (subscriber, #129325)In reply to: PHP and P++ by burki99
Parent article: PHP and P++
The real problem was Unicode support. It's basically impossible to determine by static analysis how to transform a string-manipulation program written in Python 2 into Python 3, because you don't know the language-level types of anything, and you also don't know whether any given 8-bit string (Python 2 str) is semantically text, bytes, or a Unix filesystem path (which is neither text nor bytes but an unholy amalgamation of both).
Posted Aug 15, 2019 20:41 UTC (Thu)
by juliank (guest, #45896)
[Link] (4 responses)
Posted Aug 16, 2019 10:39 UTC (Fri)
by h2g2bob (subscriber, #130451)
[Link] (3 responses)
enable_foo = b'true'
Obviously enable_foo is from one or more read() or recv() in a different module. Or from ctypes. Or from users of your library code.
Posted Aug 16, 2019 11:03 UTC (Fri)
by juliank (guest, #45896)
[Link] (1 responses)
Posted Aug 18, 2019 3:00 UTC (Sun)
by k8to (guest, #15413)
[Link]
That sounds probably useful for most code i write, and it would cause huge explosions in most code I have to work on that other people write. Probably a good idea all around.
Posted Aug 18, 2019 23:25 UTC (Sun)
by mjblenner (subscriber, #53463)
[Link]
python3 -bb
>>> b'true' == 'true'
Posted Aug 15, 2019 20:56 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (12 responses)
A UNIX filesystem name is a of bytes whose values are neither 0 nor 47. A UNIX filesystem path is sequence of UNIX filesystem names separated by non-empty sequences of bytes with value 47 ('/'). The unholy idea that there's one character set to rule them all (which - coincidentally - makes everyone bend over backwards to get support for the characters his language is written in except people from the USA) and that The Character Set Encoding is as dictated to uses as The Character Set by some entity selling operating systems is decades newer than this.
Posted Aug 15, 2019 21:02 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (9 responses)
At this point mandating UTF-8 for file names is pretty much the only sane way.
Posted Aug 16, 2019 1:01 UTC (Fri)
by flussence (guest, #85566)
[Link] (8 responses)
Posted Aug 16, 2019 1:14 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Aug 16, 2019 11:14 UTC (Fri)
by ale2018 (guest, #128727)
[Link]
Posted Aug 16, 2019 16:11 UTC (Fri)
by Deleted user 129183 (guest, #129183)
[Link] (4 responses)
Since in Unicode, precomposed characters exist only for compatibility with pre-Unicode encodings, NFD should be probably the way to go.
Posted Aug 16, 2019 16:17 UTC (Fri)
by rweikusat2 (subscriber, #117920)
[Link] (3 responses)
Posted Aug 16, 2019 20:14 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
But what does this have to do with the mess that are the file names?
Posted Aug 18, 2019 16:22 UTC (Sun)
by rweikusat2 (subscriber, #117920)
[Link] (1 responses)
Posted Aug 18, 2019 17:13 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Aug 16, 2019 19:17 UTC (Fri)
by mpr22 (subscriber, #60784)
[Link]
Posted Aug 15, 2019 22:09 UTC (Thu)
by roc (subscriber, #30627)
[Link]
If you care about those problems then you need to define the encoding of path names, and decide how to handle path names that aren't valid in the encoding.
Posted Aug 17, 2019 12:44 UTC (Sat)
by dvdeug (guest, #10998)
[Link]
> which - coincidentally - makes everyone bend over backwards to get support for the characters his language is written in except people from the USA
To the extent that's true, it's less true than any of the systems that preceded it, and one character set to rule them all seems to be the best way to reduce that problem. UNIX basically assumes that whatever character set is being used, it's a superset of ASCII, which can hardly be the fault of Unicode that was created 20 years later. Heck, in 1998, simply supporting 8-bit characters was a release goal for Debian Hamm, because many Un*x utilities didn't out of the box. That is, you could have any character set you want, as long as it's ASCII.
On any of the pre-Unicode European solutions, an Estonian named Tõnisson would be out of luck in adding his correct name to a document that French and Germans had already added their names to; one byte worked for Western Europe, and who wanted to waste more space for Estonians with names like Tõnisson? If you were lucky enough to be using something that supported ISO-2022 (i.e. someone from East Asia was probably involved), the Estonian could type his name, but not actually search safely for names, as Päts could be encoded various ways, depending on whether a German or an Estonian entered the name.
And - coincidentally - Unicode was the first and usually only character set for hundreds of languages around the world. Speakers of small, less powerful, languages like Lakota or Greenlandic or Xhosa had to resort to font hacks to get any support for the language at all, whereas now it comes free with a decent-sized Unicode font.
PHP and P++
PHP and P++
if enable_foo == u'true':
...
PHP and P++
PHP and P++
PHP and P++
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
BytesWarning: Comparison between bytes and string
PHP and P++
PHP and P++
How does UTF-8 make everybody bend over backwards?
PHP and P++
PHP and P++
PHP and P++
PHP and P++
PHP and P++
PHP and P++
PHP and P++
PHP and P++
PHP and P++
PHP and P++
PHP and P++