PHP and P++

Posted Aug 15, 2019 17:20 UTC (Thu) by NYKevin (subscriber, #129325)
In reply to: PHP and P++ by burki99
Parent article: PHP and P++

The brackets around print() were the *least* of Python 3's problems. If that had been the entire change, then both 2to3 and 3to2 would have been completely trivial programs, everyone would have transformed their code once, and then it would have been over and done with. Once every now and then, some ancient code would spit out a "SyntaxError: missing parentheses in call to print," you'd Google it, and StackOverflow would tell you "run it through 2to3," and that would be it.

The real problem was Unicode support. It's basically impossible to determine by static analysis how to transform a string-manipulation program written in Python 2 into Python 3, because you don't know the language-level types of anything, and you also don't know whether any given 8-bit string (Python 2 str) is semantically text, bytes, or a Unix filesystem path (which is neither text nor bytes but an unholy amalgamation of both).

PHP and P++

Posted Aug 15, 2019 20:41 UTC (Thu) by juliank (guest, #45896) [Link] (4 responses)

Yet in practice, translating was fairly trivial, and a lot of people were simply too lazy and did not bother.

PHP and P++

Posted Aug 16, 2019 10:39 UTC (Fri) by h2g2bob (subscriber, #130451) [Link] (3 responses)

As NYKevin said, the problem is that comparing bytes and unicode will return False. So you'll find this code the hard way:

enable_foo = b'true'
if enable_foo == u'true':
...

Obviously enable_foo is from one or more read() or recv() in a different module. Or from ctypes. Or from users of your library code.

PHP and P++

Posted Aug 16, 2019 11:03 UTC (Fri) by juliank (guest, #45896) [Link] (1 responses)

Yeah, it would be easier if comparison were strictly typed.

PHP and P++

Posted Aug 18, 2019 3:00 UTC (Sun) by k8to (guest, #15413) [Link]

Do you mean an exception on type mismatch?

That sounds probably useful for most code i write, and it would cause huge explosions in most code I have to work on that other people write. Probably a good idea all around.

PHP and P++

Posted Aug 18, 2019 23:25 UTC (Sun) by mjblenner (subscriber, #53463) [Link]

You could try the -b (or -bb) switch. e.g:

python3 -bb

>>> b'true' == 'true'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
BytesWarning: Comparison between bytes and string

PHP and P++

Posted Aug 15, 2019 20:56 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (12 responses)

> or a Unix filesystem path (which is neither text nor bytes but an unholy amalgamation of both)

A UNIX filesystem name is a of bytes whose values are neither 0 nor 47. A UNIX filesystem path is sequence of UNIX filesystem names separated by non-empty sequences of bytes with value 47 ('/'). The unholy idea that there's one character set to rule them all (which - coincidentally - makes everyone bend over backwards to get support for the characters his language is written in except people from the USA) and that The Character Set Encoding is as dictated to uses as The Character Set by some entity selling operating systems is decades newer than this.

PHP and P++

Posted Aug 15, 2019 21:02 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

> which - coincidentally - makes everyone bend over backwards to get support for the characters his language is written in except people from the USA
How does UTF-8 make everybody bend over backwards?

At this point mandating UTF-8 for file names is pretty much the only sane way.

PHP and P++

Posted Aug 16, 2019 1:01 UTC (Fri) by flussence (guest, #85566) [Link] (8 responses)

NFC, NFD or broken UTF-8?

PHP and P++

Posted Aug 16, 2019 1:14 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Any of them would be better than the status quo.

PHP and P++

Posted Aug 16, 2019 11:14 UTC (Fri) by ale2018 (guest, #128727) [Link]

That is kind of irrelevant. A system choice. Even with ASCII it has always been possible to create files whose names begin with a minus (-), or contain backspaces (x08), spaces ( ), or other characters that may confuse human and machine interpreters alike. To paraphrase the POTUS, it's not the gun that shoots you in the foot.

PHP and P++

Posted Aug 16, 2019 16:11 UTC (Fri) by Deleted user 129183 (guest, #129183) [Link] (4 responses)

> NFC, NFD or broken UTF-8?

Since in Unicode, precomposed characters exist only for compatibility with pre-Unicode encodings, NFD should be probably the way to go.

PHP and P++

Posted Aug 16, 2019 16:17 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (3 responses)

Please wake me when the unicode consortium start to consider S + combining vertical bar aka $ a precomposed character ...

PHP and P++

Posted Aug 16, 2019 20:14 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

"$" sign is not an "S-with-a-bar". It can be written as "S" with two smaller bars on top and bottom (like in the font I'm using right now).

But what does this have to do with the mess that are the file names?

PHP and P++

Posted Aug 18, 2019 16:22 UTC (Sun) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

Why don't you just repeat the original statement without using a pointless aside sharing a couple of characters with a text of mine to pseudo-connect the repetition to this text?

PHP and P++

Posted Aug 18, 2019 17:13 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

What? I have no idea what you're saying.

PHP and P++

Posted Aug 16, 2019 19:17 UTC (Fri) by mpr22 (subscriber, #60784) [Link]

For me, NFKC is the obviously-right way to normalize the names of filesystem entities.

PHP and P++

Posted Aug 15, 2019 22:09 UTC (Thu) by roc (subscriber, #30627) [Link]

Treating everything as bytes is fine for filesystem APIs, but a big problem arises when you want to print path names; if you don't know the encoding, and the path name is not ASCII, you can't print them correctly. A slightly lesser problem is the reverse: when you receive a path name that happens to be in Unicode (because it comes from user input in Unicode, for example), and is non-ASCII.

If you care about those problems then you need to define the encoding of path names, and decide how to handle path names that aren't valid in the encoding.

PHP and P++

Posted Aug 17, 2019 12:44 UTC (Sat) by dvdeug (guest, #10998) [Link]

Is it 47, or is it '/'? The latter smacks of "one character set to rule them all", because 47 is 'å' in certain dialects of EBCDIC and can be part of a multibyte character in SJIS.

> which - coincidentally - makes everyone bend over backwards to get support for the characters his language is written in except people from the USA

To the extent that's true, it's less true than any of the systems that preceded it, and one character set to rule them all seems to be the best way to reduce that problem. UNIX basically assumes that whatever character set is being used, it's a superset of ASCII, which can hardly be the fault of Unicode that was created 20 years later. Heck, in 1998, simply supporting 8-bit characters was a release goal for Debian Hamm, because many Un*x utilities didn't out of the box. That is, you could have any character set you want, as long as it's ASCII.

On any of the pre-Unicode European solutions, an Estonian named Tõnisson would be out of luck in adding his correct name to a document that French and Germans had already added their names to; one byte worked for Western Europe, and who wanted to waste more space for Estonians with names like Tõnisson? If you were lucky enough to be using something that supported ISO-2022 (i.e. someone from East Asia was probably involved), the Estonian could type his name, but not actually search safely for names, as Päts could be encoded various ways, depending on whether a German or an Estonian entered the name.

And - coincidentally - Unicode was the first and usually only character set for hundreds of languages around the world. Speakers of small, less powerful, languages like Lakota or Greenlandic or Xhosa had to resort to font hacks to get any support for the language at all, whereas now it comes free with a decent-sized Unicode font.