Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 15:29 UTC (Tue) by martin.langhoff (subscriber, #61417)
In reply to: Watson: Launchpad now runs on Python 3 by mb
Parent article: Watson: Launchpad now runs on Python 3

> Well, so your data is broken.

That's crazy talk. _I don't control the inputs_. There's spaces where you "own" the data and its formatting. There's spaces where you _don't_.

For example, version control systems will try to "parse" text in limited ways -- newlines for diff/patching -- while being agnostic about individual characters. And Unix allows any old crud in a directory entry, _it's a bag of bytes_. The Mercurial folks had a similarly hard times with Py3.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 16:18 UTC (Tue) by mb (subscriber, #50428) [Link] (11 responses)

>I don't control the inputs

But you do control how you process that data.
If your input data might be broken, then of course you can't just decode it with the default parameters, because that will raise exceptions and abort if not caught. (and that's a sane default).
You need to tell the program/libraries/language what you want to do.

>newlines for diff/patching -- while being agnostic about individual characters.

That's just waiting to blow up, if you just scan for values at bytes boundaries, which could be in the middle of any Unicode character.
But if you really want to do such things, just do _not_ decode it and work with bytes (b'\n', etc...).

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 16:50 UTC (Tue) by comex (subscriber, #71521) [Link] (2 responses)

UTF-8 is intentionally designed so that the encoding of one character will never appear within the encoding of another character. As part of that, non-ASCII characters are encoded using a series of bytes that all have the highest bit set. So as long as you only care about UTF-8 rather than legacy multibyte character sets, you don’t have to worry about finding newline bytes in the middle of other characters.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:26 UTC (Tue) by mb (subscriber, #50428) [Link] (1 responses)

>UTF-8 is intentionally designed so that the encoding of one character will never appear within the encoding of another character.

But that doesn't help, if the input data encoding is not known or if it's not even known to be correctly encoded.
That's where the error handling options of Python's codecs come into play.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 23:41 UTC (Tue) by lsl (subscriber, #86508) [Link]

Most of the time I work with protocols or interfaces where certain bytes values (e.g. '/', '\n' or '\0') are assigned some meaning but I don't have to care at all what meaning is encoded into other byte values, I just need to be able to reproduce them again at some later point.

Python 3 made that very inconvenient, especially in inital versions where there weren't any byte string versions of common functionality (e.g. getenv).

Even today, many Python libraries and programs just explode because they don't use the byte string versions of standard library functions even when it would be appropriate (common examples are Unix filesystems or environment variables as well as many network protocols). The str variants tend to be more prominently documented and more convenient so that's what many authors use.

Reliability suffers as a consequence.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:07 UTC (Tue) by khim (subscriber, #9252) [Link] (7 responses)

> You need to tell the program/libraries/language what you want to do.

Except Python3 made it, initially, impossible. Because the only sane option for a lot of data out there is “that's some sequence of bytes with some strings injected into it”.

Python3 had binary strings which were possible to use for processing such data but they were limited (on purpose).

That was huge regression compared to Python 2 (not even sure they fixed everything since I have never switched from Python 2 to Python 3).

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:23 UTC (Tue) by mb (subscriber, #50428) [Link] (6 responses)

>Except Python3 made it, initially, impossible.

Do you have an example?

>Because the only sane option for a lot of data out there is “that's some sequence of bytes with some strings injected into it”.

If your input data has strings embedded in non-Unicode bytes, then extract the strings before decoding the Unicode.
That's how it should have been done in the Python 2 script, too.

>That was huge regression compared to Python 2

What exactly is impossible to do in Python 3, that had worked fine in Python 2?
A short example helps the discussion.

>not even sure they fixed everything since I have never switched from Python 2 to Python 3

Ok, so this is just FUD?

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 17:50 UTC (Tue) by khim (subscriber, #9252) [Link] (4 responses)

> Do you have an example?

Just read an announcement. One example is proudly shown right there: Note: the 2.6 description mentions the format() method for both 8-bit and Unicode strings. In 3.0, only the str type (text strings with Unicode support) supports this method; the bytes type does not. The plan is to eventually make this the only API for string formatting, and to start deprecating the % operator in Python 3.1.

How, pray tell, should I format anything if I'm told not to use % and format doesn't work with real data and % is supposed to be removed?

> If your input data has strings embedded in non-Unicode bytes, then extract the strings before decoding the Unicode.

How do you propose to do that if we are dealing with XML which includes random binary strings? Yes, I know it's not valid XML, but you know, customers don't care. If old version of program works and new doesn't then they would just say it's broken and would ask to fix it.

> What exactly is impossible to do in Python 3, that had worked fine in Python 2?

Python 2.6 and Python 3.0: the ability to get the filename and, you know, open the file, e.g.

Yes, I know, they fixed that in Python 3.4. By stopping to pretend that filenames are strings.

And maybe, by now, they have even made it possible to combine them with raw sequences of binaries.

But I'm not really interested in that now: I have left the Python camp when they broke strings in Python 3. That was better choice than waiting for when they would make it kinda-sorta usable again.

> Ok, so this is just FUD?

If press-release by language makers is FUD to you then yeah, that FUD, I guess.

Python 2.6 was much better than Python 3.0 and Python 2.7 was much better than Python 3.3. I later found out that issues with filenames was fixed in Python 3.4 (kinda-sorta: you still can't work with filenames as well as you can in 2.7, but at least you can support them now… what an achievement), but I suspect even latest versions of Python 3 are still worse than Python 2.7 (although, true, I don't use it thus I wouldn't know).

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 23:54 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (3 responses)

> Just read an announcement. One example is proudly shown right there: Note: the 2.6 description mentions the format() method for both 8-bit and Unicode strings. In 3.0, only the str type (text strings with Unicode support) supports this method; the bytes type does not. The plan is to eventually make this the only API for string formatting, and to start deprecating the % operator in Python 3.1.
>
> How, pray tell, should I format anything if I'm told not to use % and format doesn't work with real data and % is supposed to be removed?

This problem was fixed in 2014 with PEP 461. Complaining about something that was fixed many years ago is indeed FUD.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 6:28 UTC (Wed) by khim (subscriber, #9252) [Link] (2 responses)

> This problem was fixed in 2014 with PEP 461. Complaining about something that was fixed many years ago is indeed FUD.

Not if we are discussing history. Switch to Python 3 was barely mitigated disaster: it replaced working string model with a broken one and added enough incompatibilities that it took 10 years.

Granted, all three “big P” languages (Perl, PHP and Python) tried to do that and only Python managed to break the developers expectations yet hold the mindshare, but I wonder what would have happened in an alternate history where at least one camp would have tried to evolve language without huge breakages.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 19:08 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

> Not if we are discussing history. Switch to Python 3 was barely mitigated disaster: it replaced working string model with a broken one and added enough incompatibilities that it took 10 years.

Meh. So what? Python 2 lost support well over a year ago. As far as I'm concerned, this is all ancient history at this point. But every time anyone mentions Python, in any capacity, on LWN (or Hacker News, for that matter), the comments *always* turn into a ridiculous flame war over it.

Look. I get it. The flag day was painful, arguably unnecessary, etc. But what's done is done, and it's clearly not going to happen again. So maybe we can all just take a breath and find something more productive to talk about?

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 19:39 UTC (Wed) by khim (subscriber, #9252) [Link]

> Python 2 lost support well over a year ago.

Yup. Means now it's time to think what to do about that mess. Look here, for example

Distributions are starting to remove Python 2 which means that simple solution which worked for years (ignore the Python3 existence and use Python 2) no longer works. Enterprise guys are starting to switch. Some of them. But that saga is still far from being over. When RHEL 8 would go out of support? 2029? Well, I guess by 2030 or so we may declare Python 2 dead and buried then.

> As far as I'm concerned, this is all ancient history at this point.

Sorry but if that's all “ancient history at this point” then why do you even leave comments under article which proves that it's most definitely not an ancient history?

Look on it's title for chrissake! You may try to go away and pretend that it doesn't exist but it's a stupid to pretend that all that pain happened years ago somehow when commenting something that shows that story is still ongoing.

> Look. I get it. The flag day was painful, arguably unnecessary, etc. But what's done is done, and it's clearly not going to happen again. So maybe we can all just take a breath and find something more productive to talk about?

Sure. If that would have been what mb would have said then I would have stopped the discussion. But that's not what happened.

It's a bit like pain if switching from Winows Classic to Windows XP or from Mac Classic to MacOS X. You may say what's “done is done” (and that would be true!) but that still doesn't change the fact that it was painful and unnecessary.

Computer industry is now mature industry. Decades-long deprecation cycles are typical. You can't avoid it.

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 18:53 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> Do you have an example?

For example, it was initially impossible in Python 3.0 to use byte strings in "%" formatting. It was fixed only in Python 3.5: https://www.python.org/dev/peps/pep-0461/

> If your input data has strings embedded in non-Unicode bytes, then extract the strings before decoding the Unicode. That's how it should have been done in the Python 2 script, too.

In many cases what I want is to read the data, parse it a bit and then give it verbatim to the next layer that might make more sense about it. Even if the data can't be properly represented as valid UTF-8.

Filesystems were a great example. Up until Py3.6 the built-in Python filesystem module simply SKIPPED undecodable file names in directory listings. How's that for reliability?

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 5:49 UTC (Wed) by dvdeug (guest, #10998) [Link] (7 responses)

> That's crazy talk. _I don't control the inputs_. There's spaces where you "own" the data and its formatting. There's spaces where you _don't_.

There's no space where you can't reject data, because there's simply data you can't make head-or-tails of. And you can always delegate massaging data into a sane format into a separate function.

> And Unix allows any old crud in a directory entry, _it's a bag of bytes_.

That's broken, because it's not. What's stored in 62696e? If we weren't having this discussion, would you even start to think that was a standard Unix directory with a three-letter name? https://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Common... says that certain libraries must be installed with runtime names like libcrypt.so.1. At no point does it mention that that's a transliteration into ASCII. (It could be κιβγρψοτ.σξ.1, with ELOT 927; that's an entirely acceptable reading of a bag of bytes.) If it were a bag of bytes, then the standard would be careful not to be implicit about it. In reality, filenames are strings, and anyone taking advantage of its bag of bytes nature is being careless or willfully malicious.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 7:43 UTC (Wed) by mpr22 (subscriber, #60784) [Link] (1 responses)

I wasn't aware that "bgn" was a standard Unix directory.

Live and learn, I guess.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 7:44 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

... Derp. I should *finish* my coffee, that's "bin".

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 11:11 UTC (Wed) by HelloWorld (guest, #56129) [Link] (4 responses)

> In reality, filenames are strings, and anyone taking advantage of its bag of bytes nature is being careless or willfully malicious.
People have been using file names in encodings other than UTF-8 (such as ISO-8859-1) for decades, and that can easily result in arbitrary byte sequences. There's nothing in the APIs that prevents you from creating non-UTF-8 file names, and it's easy to create multiple files with names that represent the same sequence of graphemes when interpreted as UTF-8. The simple fact of the matter is: the only sane way to make sense of Unix file names is to treat them as a bag of bytes.

> At no point does it mention that that's a transliteration into ASCII.
Because most real-world encodings (UTF-8, ISO-8859-*) are ASCII compatible, so it doesn't matter. And besides: LSB? Nobody cares.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 12:27 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

And is there anything in the spec for Unix that forbids EBCDIC? or KOI-8?

Aiui, the spec states that NULL terminates the string (so that can't be part of a file name), and "/" separates elements within a path. Apart from those two, any individual file or directory *identifier* (let's call it that) can be any random byte pattern. (I've had systems where control characters were deliberately inserted into file identifiers to protect the files as much as possible from accidental damage.)

While I think utf-8 can contain pretty much the entire sequence of valid bytes, there are order limits so it's not a *random* byte pattern.

Cheers,
Wol

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 12:40 UTC (Wed) by HelloWorld (guest, #56129) [Link] (1 responses)

> And is there anything in the spec for Unix that forbids EBCDIC? or KOI-8?
No, because everybody knows that when file names are specified as strings in such specifications, there's an implicit assumption that the encoding is ASCII. And that works because such specifications generally don't specify file names with non-ASCII characters in them.

> Aiui, the spec states that NULL terminates the string (so that can't be part of a file name), and "/" separates elements within a path. Apart from those two, any individual file or directory *identifier* (let's call it that) can be any random byte pattern.
Exactly: a file name is a bag of bytes except 0 and 2F. Would another format be better? Perhaps so, but it's too late to enforce that at this point.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 21:20 UTC (Wed) by dvdeug (guest, #10998) [Link]

> Would another format be better? Perhaps so, but it's too late to enforce that at this point.

It's been discussed; see https://dwheeler.com/essays/fixing-unix-linux-filenames.html . The POSIX requirement is for incredibly limited names only, and it's easy to restrict it greatly at the cost of zero to most users.

Watson: Launchpad now runs on Python 3

Posted Aug 4, 2021 20:58 UTC (Wed) by dvdeug (guest, #10998) [Link]

> the only sane way to make sense of Unix file names is to treat them as a bag of bytes.

It's not a bag. I skipped this first time around, but a bag is generally a name for a multiset, or at least some other data structure that has no order. A Unix file name is a sequence of bytes.

Again, nobody treats it as a sequence of bytes. Coreutils will make sure not to dump random noise to the screen, but there's no standard C functions to do that that I recall, and I don't know if GNU Libc has something there. It's not sane if everyone just treats them as strings, and anyone wanting to handle them as a sequence of bytes has to write special code and tiptoe around everything.