Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Posted Jan 23, 2021 2:28 UTC (Sat) by david.a.wheeler (subscriber, #72896)In reply to: This is 2021: what’s coming in free/libre software (Libre Arts) by pgdx
Parent article: This is 2021: what’s coming in free/libre software (Libre Arts)
> I've done my fair share of porting stuff from Python 2 to Python 3, and it's certainly work, but in the end it was always worth it.
I like Python (both versions). And I've ported 2->3.
But no, it was NOT worth it. There's nothing in Python3 that justified the horrific transition process. There are no new capabilities in Python3 that justified their $100billion+ tax on the industry, and there was no reason the transition had to be that bad. Python2 was perfectly capable of supporting Unicode, so "Python3 was necessary because it supports Unicode" is not a valid argument.
> Python 3 was released more than a decade ago and it was always clear that (a) you needed to upgrade, (b) the longer you pretend like you're not, the harder it will be, and (c) you can write code that works reasonably well on Py 2 and 3.
> Can we stop the whining now?
No, because there are too many people still actively denying that "the Python 2->3 transition was badly botched because it made upgrading far more difficult than necessary".
What *should* have happened was a set of import __future__ statements so that individual files and libraries could be *gradually* transitioned from Python2 to Python3. That happened with previous versions of Python. A similar kind of thing happens in almost all other programming languages: it is *vital* to allow incremental transitions.
Instead, the developers of Python confused "Python the implementation" with "Python the language", so they failed provide an implementation supporting both. This made switching from 2->3 an all-or-nothing situation (ALL your transitive dependencies had to simultaneously convert AND all your files had to convert as well). Normally transitioning involves small incremental costs that can be spread over years. In this case, transition became far more costly & risky than it needed to be. What's especially surprising is that the developers of Python3 are smart & experienced; this is the kind of fundamental mistake that you wouldn't expect such a seasoned team to make.
The only other example I can think of in history is "Visual Fred". Microsoft's Visual Basic was once one of the most popular programming languages, but "Visual Basic .NET" (aka "Visual Fred") was grossly incompatible with the "normal" Visual Basic. Many years later about 1/3 switched to the new Visual Basic, 1/3 kept to the old one, and 1/3 had abandoned Visual Basic entirely (if you had to rewrite your software anyway, why use the language from the people you just got burned by?). Visual Basic lost its standing, never to be regained.
Python3 is very lucky that it could ride the machine learning wave. It's unlikely to be so lucky if it makes transitions so hard again.
Everyone makes mistakes, even very knowledgeable people like the Python3 developers. But we need to *learn* from mistakes, not *deny* them.
Posted Jan 23, 2021 21:10 UTC (Sat)
by khim (subscriber, #9252)
[Link] (12 responses)
Actually it's back in top 10 now. But if you look on the fine print… you'll see that it went basically to zero… and then slowly regained mindshare.
It would be interesting to see how Python would fare.
P.S. Why I'm not considering Python transition “done”? Because hard part have only just started yet: now, when people who ignored question of “should I switch to Python3… or to some other language entirely” for an long as possible and pushed into the corner… the “fun part” begins.
Posted Jan 24, 2021 21:20 UTC (Sun)
by linuxrocks123 (subscriber, #34648)
[Link] (11 responses)
You think you can make me? Why would you think that? What, you're not maintaining the code anymore? So? Does anyone maintain ed? Does it matter?
I'll be able to pull in a Python 2.7 package for any remotely popular Linux distro for the indefinite future, either from the distro itself or a third party repo. And if I can't for some reason, I can compile it from source.
I'm not following the Python core devs down the road to Stupidville. Ever. And you. Can't. Make. Me.
Posted Jan 24, 2021 22:43 UTC (Sun)
by pizza (subscriber, #46)
[Link] (8 responses)
"pip 21.0, in January 2021, will remove Python support"
I hope you're maintaining your own private mirror of the python package archive.
> I'm not following the Python core devs down the road to Stupidville. Ever. And you. Can't. Make. Me.
Well, good for you. After all, there's still folks making a living selling riding crops and buggy whips.
Posted Jan 24, 2021 23:36 UTC (Sun)
by pizza (subscriber, #46)
[Link] (7 responses)
Gah. That should be:
"Note: pip 21.0, in January 2021, will remove Python 2 support, per pip’s Python 2 support policy. Please migrate to Python 3."
Posted Jan 25, 2021 0:32 UTC (Mon)
by linuxrocks123 (subscriber, #34648)
[Link] (6 responses)
So? I can still use the old version of pip forever, just like I can use Python 2.7 forever.
> I hope you're maintaining your own private mirror of the python package archive.
Do you have reason to believe they're about to delete all the Python 2.7 packages? Serious question, because, if so, you're right: I _DO_ want to do that now before that happens. Not just for me; I'd make something like the Classic Addons Archive, which happened when Mozilla decided to be f*wits and delete all the good API Firefox addons.
Posted Jan 25, 2021 2:47 UTC (Mon)
by pizza (subscriber, #46)
[Link] (5 responses)
Upstream python no longer gives a hoot about python2, and this extends to maintaining the tooling and infrastructure that supports python2.
> So? I can still use the old version of pip forever, just like I can use Python 2.7 forever.
I think you'll find pip rather useless without a compatible package archive to query and download from.
Bottom line: it's only a matter of time before python2-compatible/specific stuff bitrots.
Posted Jan 25, 2021 21:47 UTC (Mon)
by linuxrocks123 (subscriber, #34648)
[Link] (4 responses)
Seriously? It's pretty trivial to mirror a website, man. The only problem would be if pypi.org act like little shits and just delete all the Python 2 packages without warning, and no one knew they had to mirror it. I haven't gotten that vibe from the Python core devs. They don't like the continued relevance of Python 2, but they're also not unethical asshats. I think.
So, again: do you have reason to think that's coming? 'cuz I'll bulk download the Python 2 packages from there this weekend if I have to.
Using a small script. Written in Python 2 :P
But I'd rather not waste my time and their bandwidth if they're not about to pull the rug out like that.
Posted Jan 25, 2021 22:10 UTC (Mon)
by linuxrocks123 (subscriber, #34648)
[Link] (3 responses)
1. Some dudes have "script to mirror PyPi" covered already: https://github.com/pypa/bandersnatch/
Hopefully this means someone else has their data backed up already, and I don't need to do anything.
2. SWEET JEBUS. WHAT ARE THEY HOSTING ON THIS SITE? https://pypi.org/stats/
I might need to buy another HD if I need to slurp 8TB off of them. But: that's obviously not all source code. What are they _DOING_?
Posted Jan 25, 2021 23:19 UTC (Mon)
by linuxrocks123 (subscriber, #34648)
[Link]
Posted Feb 3, 2021 21:11 UTC (Wed)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Feb 3, 2021 21:41 UTC (Wed)
by excors (subscriber, #95769)
[Link]
In theory, "gzip --rsyncable" might help with that - identical chunks of uncompressed input should give identical chunks of compressed output (at a cost of ~1% increase in compressed file size).
Posted Jan 24, 2021 22:58 UTC (Sun)
by sfeam (subscriber, #2841)
[Link]
Posted Jan 25, 2021 7:55 UTC (Mon)
by HelloWorld (guest, #56129)
[Link]
And why do you think anybody would make you if they could? So you really think anybody cares if you use a bitrotten stack?
Posted Jan 24, 2021 8:40 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (21 responses)
1. absolute_import literally takes longer to understand than to fix (mostly because PEP 328 is far too rationale-heavy and doesn't have a quick "here's how the new syntax works" section, so you actually have to read the damn thing). Once you understand what it does, fixing it is a breeze.
So instead of using __future__, everyone ended up using six to write disgusting polyglot code, and doing lots of painful regression testing against both 2.x and 3.x in the process. Yuck.
Unfortunately, I'm not sure that they had a whole lot of alternatives for how unicode_literals could have worked. You can't have unicode objects be strict in one file and lenient in another, as that's a runtime behavior, not a parsing or lexing difference. There are some really exotic ideas you might try, such as being strict-by-default, but walking the stack and automatically suppressing the exception if the caller doesn't have an active __future__ statement allowing strictness. But that's hard to reason about and hard to understand, much less explain, so I'm not sure it would have been a Good Idea. You could also break back-compat in 2.x's Unicode support, or go lenient in 3.x, but the former would have probably caused even more problems for various people, and the latter seems like it would defeat the whole point of the entire 2->3 exercise.
Posted Jan 24, 2021 23:50 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Jan 25, 2021 19:39 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link]
If you just blindly apply the __future__ statement without checking, then things will break and you really should not be surprised by that.
Posted Jan 24, 2021 23:51 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (18 responses)
Boom. You're done. No need for the false "bytes/unicode" dichotomy.
Look at Go as an example.
Posted Jan 25, 2021 9:30 UTC (Mon)
by jafd (subscriber, #129642)
[Link] (3 responses)
Posted Jan 25, 2021 10:13 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
> Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.
Second, the whole post is misleading. Go makes sure to canonicalize the header keys (via textproto.MIMEHeader) and takes care NOT to touch header values in any way. You can round-trip the values just fine, without any corruption.
Any upper/lower case shenanigans would be done by the HTTP consumers.
And a conformant implementation would also allow to use MIME encoding (RFC 2740) to pass header values with "dangerous" symbols. So consumers still need to be on a lookout for that.
Posted Jan 25, 2021 20:52 UTC (Mon)
by foom (subscriber, #14868)
[Link] (1 responses)
Of course, accepting a strange transfer-encoding shouldn't be a security issue, because Transfer-Encoding is a hop-to-hop header, so there should be no way to convince a proxy to produce a weird Transfer-Encoding on their outgoing connections, regardless. But, oops, a lot of proxies are buggy and incorrectly pass what the user provided straight through to the outgoing connection.
Posted Jan 25, 2021 21:00 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jan 25, 2021 20:02 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (13 responses)
You can't do that if the language assumes UTF-8 everywhere, because the standard encoding on Windows is UTF-16LE. So now you have a number of options, all of them bad:
1. Provide a shim layer that magically converts UTF-16LE into UTF-8, using the same surrogateescape trick that everyone currently loves to hate.
#1 is bad because it makes it harder to see the actual bytes that the OS is dealing with, which is frustrating for anyone who actually wants to interface directly with the Windows API. Also, everyone and their dog has already complained about surrogateescape, and they would complain about this even louder. #2 is bad because it makes programs that work on Linux not-work on Windows. #3 is bad because it's basically identical to the actual Python 3 solution, except without any type checking, so you can blindly concatenate UTF-8 and UTF-16LE, and get garbage.
Posted Jan 25, 2021 20:10 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
In practice, Windows can be supported just fine by translating the Windows encoding into utf-8 at the boundary. The major problem is that Windows allows to use ANY sequence of 16-bit characters in file names and other APIs, including incorrect surrogate pairs. So these incorrect strings still need to get escaped (via surrogateescapes or an equivalent) if you want to have well-formed Unicode strings.
This is basically what Python does anyway. And if you're doing escaping then why not just use utf-8 explicitly?
Posted Jan 26, 2021 3:32 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (7 responses)
Of course, now that we know that using UTF-8 for all your strings is workable and results in a language (Go) that basically makes sense, they could break back-compat *again*... but I don't think anyone would be OK with that.
[1]: https://googleblog.blogspot.com/2012/02/unicode-over-60-p...
Posted Jan 26, 2021 4:51 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Jan 26, 2021 19:18 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
>>> 'a' == 'a'[0] == 'a'[0][0] == 'a'[0][0][0]
Which of those two APIs do you propose to break?
Posted Jan 26, 2021 20:05 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jan 29, 2021 15:29 UTC (Fri)
by mokki (subscriber, #33200)
[Link] (2 responses)
The Java interfaces have survived the backend OS transitions from different charsets to different Unicodes.
Posted Jan 30, 2021 12:20 UTC (Sat)
by HelloWorld (guest, #56129)
[Link] (1 responses)
Posted Jan 30, 2021 15:36 UTC (Sat)
by foom (subscriber, #14868)
[Link]
But java is certainly far from a great example for python to follow. For example, you also cannot access all filenames on unix using the java file api, because it just gives up when an existing filename can't be decoded to unicode (just like Python3.0). Java only recently fixed this with the introduction of new apis in java.nio2.
I think this bug seemed less of an issue in Java programs since they mostly live off in a their own world -- and it wasn't a _regression_ in the language either -- the few java programs that needed the ability to interact with the filesystem had already written their own JNI file code.
All these platforms _are_ however evidence of there being not much use in providing a fixed width "unicode codepoint string" type at all, though -- since none of them actually provide it. And given the options of variable width codepoint strings, UTF-8 is clearly a better choice than UTF-16.
Posted Jan 31, 2021 11:47 UTC (Sun)
by anton (subscriber, #25547)
[Link]
As far as programming languages are concerned, Python is the only language that I am aware of that took this particular approach. Pretty much everybody who started before Unicode stuck with 8-bit code units and variable-width code points. Those unlucky souls who defined their languages and APIs in the early 1990s (Java, Javascript, Windows NT) decided on 16-bit code units (which were at that time assumed to be full characters), and then stuck with them instead of inflating them to 32-bits to support Unicode 2.0 code points (which are not necessarily characters); so they, too, recognized that it's easier to deal with variable-width code points than to transition to wider code units. And after Unicode 2.0 (1996) pretty much everybody recognized that, except Python.
As an example where I experienced the history closely, Forth originally had 8-bit characters. The 1994 Forth standard provided abstractions for various unit widths rather than fixing them like earlier standards; this included character size. There was one proof-of-concept system for Windows that used 16-bit characters, but all other systems (including Windows-only systems) stayed with 8-bit characters, and without explicit support for Unicode.
However, the magic of UTF-8 means that most of the code works unchanged with UTF-8. It took a while for us to realize that, but eventually we did, but we also thought that we should provide Forth words for the other cases and wrote a paper about that in 2005. These Forth words were included (after some modifications coming from the standardization discussion) in the next Gforth release in 2008. These words were also proposed for standardization and approved in 2010.
And it's not like Forth was a particular forerunner. C standardized multi-byte functions in 1989 (admittedly for the common Asian encodings of the time; UTF-8 was only invented in 1992; C89 also standardized wchar_t for the upcoming Unicode). The low popularity of wchar_t in Unix programs, and the use of wchar_t in Windows for code units instead of code points should have given the hint to the Python developers that code points are not as important as the Python 3 string representation makes them.
I am sure that people familiar with other languages can contribute how their language deals with Unicode.
Posted Jan 25, 2021 22:13 UTC (Mon)
by foom (subscriber, #14868)
[Link]
That is, python is effectively doing #1 on windows already, except that it's converting to a unicode-codepoint vector. There wouldn't really be much difference in Windows support if the output was an 8-bit utf-8 vector containing lone surrogates, instead of a 32bit unicode codepoint vector containing lone surrogates.
Yes, back in Python2, there was indeed a real problem with Unix/Windows portability. There were two interfaces that programs had to choose between -- you could pass either a 'unicode' string or a 'str' string to all the filesystem APIs, and get the same back out. Almost all software passed 'str' strings, which was perfect on Unix. But on windows, Python called different Win32 APIs (e.g. CreateFileA vs CreateFile) depending on the type of string you passed. And the "*A" calls used the legacy single-byte windows system codepage, which couldn't couldn't access filenames with "weird" characters in them. This was, of course, a big pain for cross-platform python code.
But, this could've been simply fixed within the Python 2.X series, just by having Python always call the wide-string win32 APIs -- and then encoding/decoding to utf8 itself, when given 'str' arguments. Magically, all the code written for Unix would've started working perfectly on windows as well.
Posted Jan 26, 2021 14:16 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
Hahahaha. Python 3.8 introduced `os.add_dll_directory()` and started requiring it for DLL loading of Python modules to work because they now only look in:
- the directory of the loaded module
for dependent DLLs (via flags to `LoadLibraryEx`). This completely breaks with Unix-style installation trees because the modules are under `Lib/site-packages` and the DLLs are under `bin`. Right now our build trees are broken because we have to figure out where the external dependency DLLs are and add them to the set of directories (the install tree is just going to have to be a unified one or the package manager (vcpkg or conda) will have to patch it up for us because we want to remain relocatable).
Of course, PE could just implement RPATHs, but that seems like something that would have happened already if it was ever going to.
Posted Jan 26, 2021 15:24 UTC (Tue)
by halla (subscriber, #14185)
[Link] (1 responses)
Posted Jan 26, 2021 21:59 UTC (Tue)
by HelloWorld (guest, #56129)
[Link]
Posted Jan 25, 2021 20:34 UTC (Mon)
by foom (subscriber, #14868)
[Link]
But, by now, the "Python 3 transition was bad" discussion is rather boring and repetitive. :) FWIW, Guido has indeed publicly stated that the Python3 transition was a mistake, and he would not do it again. See e.g. https://www.youtube.com/watch?v=Oiw23yfqQy8 [GvR presenting at a PyCascades conference in 2018]
> Visual Basic lost its standing, never to be regained.
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
2. division is a five-minute fix in the vast majority of reasonable situations. If you were doing something crazy, like writing code that wanted to treat 17 and 17.0 as different values, then you were in trouble and probably had to use something like functools.singledispatch to fix it. So a fifteen-minute fix.
3. print_function is also a five-minute fix, and who uses print in real code to begin with?
4. unicode_literals was *supposed* to be "the big __future__ statement" that you would turn on right before you were ready to transition to 3.x... but it had the severe misfortune of not actually producing identical behavior to 3.x's strings (because 2.x's unicode objects were lenient and 3.x's strings were strict). Consequently, nobody used it, and the rounding error of people who did use it derived very little benefit from it.
5. To add insult to injury, early versions of 3.x forbade the use of the u'' prefix. This was an obviously incorrect decision, and they reversed it in (IIRC) 3.3-ish.
Python 2->3 transition was horrifically bad
I've once spent several days fixing a problem caused by division. It poisoned a value, making it a float that caused problems way downstream.
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Mandate all strings to be UTF-8 by default. Add UTF-8 operations to the core language.
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
golang itself is doing unicode lowercasing on Transfer-Encoding in net/http/transfer.go. That is unexpected and weird, and it probably should be using ascii-only lowercasing, like canonicalMIMEHeaderKey does.
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
2. Hand raw UTF-16LE back and let the program barf on the embedded nulls.
3. Provide two interfaces, and let the program choose between them.
Python 2->3 transition was horrifically bad
Well, it was failing in that even a couple of years ago.
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
True
>>> b'a' == b'a'[0]
False
>>> b'a'[0][0]
<stdin>:1: SyntaxWarning: 'int' object is not subscriptable; perhaps you missed a comma?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'int' object is not subscriptable
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
- convert console in/out and filenames to Unicode on the OS layer, with sane guess of OS defaults (and possibility to override)
- make all byte/string conversions take encoding, with again sane guess on good default. Add checkers to warn if using defaults instead of explicit charset
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
The other other problem is that Python 3.0 came out in 2008, and according to the graph in [1], at that time UTF-8 was only *just* starting to win the format war (in the limited context of web content).
Your own reference shows that UTF-8 was the only Unicode encoding in the web in 2008 (or at any time). UTF-16 or UTF-32 are not even mentioned there. The other formats that had as much popularity in 2008 as UTF-8 were ASCII (i.e., a subset of UTF-8), and the byte-wide Latin encodings.
The default for sys.getfilesystemencodeerrors() on windows is now 'surrogatepass' which means "Oh, that invalid UTF16LE input with unpaired surrogates you got as input? Just pass them through as invalid unicode half-surrogate codepoints on output, it's fine!"
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
- the directory of the interpreter
- paths added by this new call
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad
Python 2->3 transition was horrifically bad