Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Posted Aug 3, 2021 15:29 UTC (Tue) by martin.langhoff (subscriber, #61417)In reply to: Watson: Launchpad now runs on Python 3 by mb
Parent article: Watson: Launchpad now runs on Python 3
That's crazy talk. _I don't control the inputs_. There's spaces where you "own" the data and its formatting. There's spaces where you _don't_.
For example, version control systems will try to "parse" text in limited ways -- newlines for diff/patching -- while being agnostic about individual characters. And Unix allows any old crud in a directory entry, _it's a bag of bytes_. The Mercurial folks had a similarly hard times with Py3.
Posted Aug 3, 2021 16:18 UTC (Tue)
by mb (subscriber, #50428)
[Link] (11 responses)
But you do control how you process that data.
>newlines for diff/patching -- while being agnostic about individual characters.
That's just waiting to blow up, if you just scan for values at bytes boundaries, which could be in the middle of any Unicode character.
Posted Aug 3, 2021 16:50 UTC (Tue)
by comex (subscriber, #71521)
[Link] (2 responses)
Posted Aug 3, 2021 17:26 UTC (Tue)
by mb (subscriber, #50428)
[Link] (1 responses)
But that doesn't help, if the input data encoding is not known or if it's not even known to be correctly encoded.
Posted Aug 3, 2021 23:41 UTC (Tue)
by lsl (subscriber, #86508)
[Link]
Python 3 made that very inconvenient, especially in inital versions where there weren't any byte string versions of common functionality (e.g. getenv).
Even today, many Python libraries and programs just explode because they don't use the byte string versions of standard library functions even when it would be appropriate (common examples are Unix filesystems or environment variables as well as many network protocols). The str variants tend to be more prominently documented and more convenient so that's what many authors use.
Reliability suffers as a consequence.
Posted Aug 3, 2021 17:07 UTC (Tue)
by khim (subscriber, #9252)
[Link] (7 responses)
Except Python3 made it, initially, impossible. Because the only sane option for a lot of data out there is “that's some sequence of bytes with some strings injected into it”. Python3 had binary strings which were possible to use for processing such data but they were limited (on purpose). That was huge regression compared to Python 2 (not even sure they fixed everything since I have never switched from Python 2 to Python 3).
Posted Aug 3, 2021 17:23 UTC (Tue)
by mb (subscriber, #50428)
[Link] (6 responses)
Do you have an example?
>Because the only sane option for a lot of data out there is “that's some sequence of bytes with some strings injected into it”.
If your input data has strings embedded in non-Unicode bytes, then extract the strings before decoding the Unicode.
>That was huge regression compared to Python 2
What exactly is impossible to do in Python 3, that had worked fine in Python 2?
>not even sure they fixed everything since I have never switched from Python 2 to Python 3
Ok, so this is just FUD?
Posted Aug 3, 2021 17:50 UTC (Tue)
by khim (subscriber, #9252)
[Link] (4 responses)
Just read an announcement. One example is proudly shown right there: Note: the 2.6 description mentions the format() method for both 8-bit and Unicode strings. In 3.0, only the str type (text strings with Unicode support) supports this method; the bytes type does not. The plan is to eventually make this the only API for string formatting, and to start deprecating the % operator in Python 3.1. How, pray tell, should I format anything if I'm told not to use % and How do you propose to do that if we are dealing with XML which includes random binary strings? Yes, I know it's not valid XML, but you know, customers don't care. If old version of program works and new doesn't then they would just say it's broken and would ask to fix it. Python 2.6 and Python 3.0: the ability to get the filename and, you know, open the file, e.g. Yes, I know, they fixed that in Python 3.4. By stopping to pretend that filenames are strings. And maybe, by now, they have even made it possible to combine them with raw sequences of binaries. But I'm not really interested in that now: I have left the Python camp when they broke strings in Python 3. That was better choice than waiting for when they would make it kinda-sorta usable again. If press-release by language makers is FUD to you then yeah, that FUD, I guess. Python 2.6 was much better than Python 3.0 and Python 2.7 was much better than Python 3.3. I later found out that issues with filenames was fixed in Python 3.4 (kinda-sorta: you still can't work with filenames as well as you can in 2.7, but at least you can support them now… what an achievement), but I suspect even latest versions of Python 3 are still worse than Python 2.7 (although, true, I don't use it thus I wouldn't know).
Posted Aug 3, 2021 23:54 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
This problem was fixed in 2014 with PEP 461. Complaining about something that was fixed many years ago is indeed FUD.
Posted Aug 4, 2021 6:28 UTC (Wed)
by khim (subscriber, #9252)
[Link] (2 responses)
Not if we are discussing history. Switch to Python 3 was barely mitigated disaster: it replaced working string model with a broken one and added enough incompatibilities that it took 10 years. Granted, all three “big P” languages (Perl, PHP and Python) tried to do that and only Python managed to break the developers expectations yet hold the mindshare, but I wonder what would have happened in an alternate history where at least one camp would have tried to evolve language without huge breakages.
Posted Aug 4, 2021 19:08 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Meh. So what? Python 2 lost support well over a year ago. As far as I'm concerned, this is all ancient history at this point. But every time anyone mentions Python, in any capacity, on LWN (or Hacker News, for that matter), the comments *always* turn into a ridiculous flame war over it.
Look. I get it. The flag day was painful, arguably unnecessary, etc. But what's done is done, and it's clearly not going to happen again. So maybe we can all just take a breath and find something more productive to talk about?
Posted Aug 4, 2021 19:39 UTC (Wed)
by khim (subscriber, #9252)
[Link]
Yup. Means now it's time to think what to do about that mess. Look here, for example Distributions are starting to remove Python 2 which means that simple solution which worked for years (ignore the Python3 existence and use Python 2) no longer works. Enterprise guys are starting to switch. Some of them. But that saga is still far from being over. When RHEL 8 would go out of support? 2029? Well, I guess by 2030 or so we may declare Python 2 dead and buried then. Sorry but if that's all “ancient history at this point” then why do you even leave comments under article which proves that it's most definitely not an ancient history? Look on it's title for chrissake! You may try to go away and pretend that it doesn't exist but it's a stupid to pretend that all that pain happened years ago somehow when commenting something that shows that story is still ongoing. Sure. If that would have been what mb would have said then I would have stopped the discussion. But that's not what happened. It's a bit like pain if switching from Winows Classic to Windows XP or from Mac Classic to MacOS X. You may say what's “done is done” (and that would be true!) but that still doesn't change the fact that it was painful and unnecessary. Computer industry is now mature industry. Decades-long deprecation cycles are typical. You can't avoid it.
Posted Aug 3, 2021 18:53 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
For example, it was initially impossible in Python 3.0 to use byte strings in "%" formatting. It was fixed only in Python 3.5: https://www.python.org/dev/peps/pep-0461/
> If your input data has strings embedded in non-Unicode bytes, then extract the strings before decoding the Unicode. That's how it should have been done in the Python 2 script, too.
In many cases what I want is to read the data, parse it a bit and then give it verbatim to the next layer that might make more sense about it. Even if the data can't be properly represented as valid UTF-8.
Filesystems were a great example. Up until Py3.6 the built-in Python filesystem module simply SKIPPED undecodable file names in directory listings. How's that for reliability?
Posted Aug 4, 2021 5:49 UTC (Wed)
by dvdeug (guest, #10998)
[Link] (7 responses)
There's no space where you can't reject data, because there's simply data you can't make head-or-tails of. And you can always delegate massaging data into a sane format into a separate function.
> And Unix allows any old crud in a directory entry, _it's a bag of bytes_.
That's broken, because it's not. What's stored in 62696e? If we weren't having this discussion, would you even start to think that was a standard Unix directory with a three-letter name? https://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Common... says that certain libraries must be installed with runtime names like libcrypt.so.1. At no point does it mention that that's a transliteration into ASCII. (It could be κιβγρψοτ.σξ.1, with ELOT 927; that's an entirely acceptable reading of a bag of bytes.) If it were a bag of bytes, then the standard would be careful not to be implicit about it. In reality, filenames are strings, and anyone taking advantage of its bag of bytes nature is being careless or willfully malicious.
Posted Aug 4, 2021 7:43 UTC (Wed)
by mpr22 (subscriber, #60784)
[Link] (1 responses)
Live and learn, I guess.
Posted Aug 4, 2021 7:44 UTC (Wed)
by mpr22 (subscriber, #60784)
[Link]
Posted Aug 4, 2021 11:11 UTC (Wed)
by HelloWorld (guest, #56129)
[Link] (4 responses)
> At no point does it mention that that's a transliteration into ASCII.
Posted Aug 4, 2021 12:27 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (2 responses)
Aiui, the spec states that NULL terminates the string (so that can't be part of a file name), and "/" separates elements within a path. Apart from those two, any individual file or directory *identifier* (let's call it that) can be any random byte pattern. (I've had systems where control characters were deliberately inserted into file identifiers to protect the files as much as possible from accidental damage.)
While I think utf-8 can contain pretty much the entire sequence of valid bytes, there are order limits so it's not a *random* byte pattern.
Cheers,
Posted Aug 4, 2021 12:40 UTC (Wed)
by HelloWorld (guest, #56129)
[Link] (1 responses)
> Aiui, the spec states that NULL terminates the string (so that can't be part of a file name), and "/" separates elements within a path. Apart from those two, any individual file or directory *identifier* (let's call it that) can be any random byte pattern.
Posted Aug 4, 2021 21:20 UTC (Wed)
by dvdeug (guest, #10998)
[Link]
It's been discussed; see https://dwheeler.com/essays/fixing-unix-linux-filenames.html . The POSIX requirement is for incredibly limited names only, and it's easy to restrict it greatly at the cost of zero to most users.
Posted Aug 4, 2021 20:58 UTC (Wed)
by dvdeug (guest, #10998)
[Link]
It's not a bag. I skipped this first time around, but a bag is generally a name for a multiset, or at least some other data structure that has no order. A Unix file name is a sequence of bytes.
Again, nobody treats it as a sequence of bytes. Coreutils will make sure not to dump random noise to the screen, but there's no standard C functions to do that that I recall, and I don't know if GNU Libc has something there. It's not sane if everyone just treats them as strings, and anyone wanting to handle them as a sequence of bytes has to write special code and tiptoe around everything.
Watson: Launchpad now runs on Python 3
If your input data might be broken, then of course you can't just decode it with the default parameters, because that will raise exceptions and abort if not caught. (and that's a sane default).
You need to tell the program/libraries/language what you want to do.
But if you really want to do such things, just do _not_ decode it and work with bytes (b'\n', etc...).
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
That's where the error handling options of Python's codecs come into play.
Watson: Launchpad now runs on Python 3
> You need to tell the program/libraries/language what you want to do.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
That's how it should have been done in the Python 2 script, too.
A short example helps the discussion.
> Do you have an example?
Watson: Launchpad now runs on Python 3
format
doesn't work with real data and % is supposed to be removed?Watson: Launchpad now runs on Python 3
>
> How, pray tell, should I format anything if I'm told not to use % and format doesn't work with real data and % is supposed to be removed?
> This problem was fixed in 2014 with PEP 461. Complaining about something that was fixed many years ago is indeed FUD.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
> Python 2 lost support well over a year ago.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3
People have been using file names in encodings other than UTF-8 (such as ISO-8859-1) for decades, and that can easily result in arbitrary byte sequences. There's nothing in the APIs that prevents you from creating non-UTF-8 file names, and it's easy to create multiple files with names that represent the same sequence of graphemes when interpreted as UTF-8. The simple fact of the matter is: the only sane way to make sense of Unix file names is to treat them as a bag of bytes.
Because most real-world encodings (UTF-8, ISO-8859-*) are ASCII compatible, so it doesn't matter. And besides: LSB? Nobody cares.
Watson: Launchpad now runs on Python 3
Wol
Watson: Launchpad now runs on Python 3
No, because everybody knows that when file names are specified as strings in such specifications, there's an implicit assumption that the encoding is ASCII. And that works because such specifications generally don't specify file names with non-ASCII characters in them.
Exactly: a file name is a bag of bytes except 0 and 2F. Would another format be better? Perhaps so, but it's too late to enforce that at this point.
Watson: Launchpad now runs on Python 3
Watson: Launchpad now runs on Python 3