Linux distributions and Python 2

Posted Jun 11, 2018 19:55 UTC (Mon) by Otus (subscriber, #67685)
In reply to: Linux distributions and Python 2 by Cyberax
Parent article: Linux distributions and Python 2

And yet this year's gcc releases see fit to support -std=c90 et al.

I will be completely unsurprised if 2.7 continues to see use ten years from now. I'd prefer to use it if I knew it would be supported. As is, I'm grudgingly making sure anything new also works under some 3.x.

Linux distributions and Python 2

Posted Jun 13, 2018 7:54 UTC (Wed) by codewiz (subscriber, #63050) [Link] (6 responses)

I'd be surprised if anything shipped as part of current Linux distros still required to be built in C90 mode. It's there just in case, and it's very easy to support for gcc because, as other have noted, the C and C++ languages have never broken compatibility to the level that Python did.

If C98 suddenly switched the meaning of 'char' from being an 8-bit integer to a Unicode codepoint, you can bet that we'd still be using C90 today, officially supported or not.

Linux distributions and Python 2

Posted Jun 13, 2018 8:53 UTC (Wed) by gevaerts (subscriber, #21521) [Link] (5 responses)

As far as I know, a C90 compiler can can have char be a Unicode codepoint just as well as a C98 one.

Linux distributions and Python 2

Posted Jun 13, 2018 10:13 UTC (Wed) by codewiz (subscriber, #63050) [Link] (4 responses)

There's wchar_t, but it's a distinct type from char: http://en.cppreference.com/w/cpp/language/types#Character...

While the standard doesn't even bother saying whether char is signed or unsiged, no sane C compiler would dare switching overnight the representation of C strings. Incrementally switching a large codebase from 8bit charater strings to Unicode would be nearly impossible. Pure insanity! Which is exactly why the transition to Python3 has been going on and on for the past 10 years in spite of the considerable effort the community put into it :-)

Linux distributions and Python 2

Posted Jun 13, 2018 10:55 UTC (Wed) by excors (subscriber, #95769) [Link] (3 responses)

wchar_t is a bit awkward since it can't actually hold Unicode codepoints on Windows.

Win32 did go through an ANSI->Unicode(ish) switch (with Windows NT, I presume), and tried to make it easy for applications to migrate incrementally. If you start with char* and "strings" and strcmp() and MessageBoxA(), you can gradually replace it with TCHAR* and _T("strings") and _tcscmp and MessageBox, and it will compile exactly the same as before. Then you #define UNICODE and suddenly it all gets macroed into WCHAR*/L"strings"/wcscmp/MessageBoxW etc, and you probably get a load of build errors, and you fix some before getting bored and going back to ANSI mode. After enough iterations you might get something that compiles properly with the Unicode API. Then you probably keep the TCHAR/_T/etc macros - it seems standard practice is to use those instead of the Unicode-only versions, even if you know you're never going back to ANSI.

I guess the difference with Python is that Microsoft saw the transition as an unending process, and they knew they would have to support the ANSI APIs forever, so they accepted the ugly macros as the cost of supporting both. Python expected the transition to finish quickly so they could kill off Python 2, so it seems they focused more on designing a nice end state (with arguable success) and less on easing the transition between the two states.

Linux distributions and Python 2

Posted Jun 13, 2018 11:56 UTC (Wed) by codewiz (subscriber, #63050) [Link] (2 responses)

And let's not forget _MBCS, hehe :-)

I guess Microsoft can be excused for the messy Unicode transition. They were among the first to attempt it, and at the time they were still saddled with support for a myriad of encodings from the DOS era. Oouch! :-)

But in 2008 there was no excuse whatsoever for not using the awesome UTF-8 encoding as the internal representation of strings. It was already obvious which way the wind was blowing, and pretty much every dynamic language transitioned to UTF-8, except Python. Actually, PHP 6 also tried moving to UTF-16, but after struggling with performance regressions and compatibility issues for several years, they finally realized it was a very poor decision and ditched the entire release. Then PHP 7 came out with UTF-8 strings.

Later programming languages like Go and Rust could have picked either UTF-8 or UTF-32 without compatibility concerns, and they still went with UTF-8 for simplicity and performance. You lose the O(1) random-access to Unicode codepoints, sure, but how often do you really do that? Whereas having to re-encode all text on the I/O boundary is a major inconvenience and causes all sort of round-trip issues.

Linux distributions and Python 2

Posted Jun 16, 2018 18:39 UTC (Sat) by epa (subscriber, #39769) [Link] (1 responses)

Does it really matter how the Python 3 implementation represents Unicode strings internally? It could use UTF-8, or UTF-16, or UCS-4, or even some wacky encoding of its own, and in principle you wouldn't notice any difference. I think what you are saying is that there should be an easier implicit conversion between byte strings and Unicode strings, with UTF-8-ish encoding and decoding rules used by default.

Linux distributions and Python 2

Posted Jun 24, 2018 14:40 UTC (Sun) by codewiz (subscriber, #63050) [Link]

Well, it does matter for both performance and memory usage, which are observable properties of a program as much as its output.

Earlier versions of the C++ standard intentionally left some things unspecified to give implementations a certain degree of freedom in when choosing the internal representations of data structures, etc. Turned out it was a terrible idea in practice, because some implementations, including GCC, chose to make std::string copy-on-write with reference counting, some had a small-string optimization which would save an allocation, and others would append the \0 on the fly only when someone invoked c_str(), occasionally causing the string to be reallocated.

This caused portability issues where a valid and idiomatic C++ program which would normally execute in 1 second on Linux could take hours and run out of memory on a different standard-compliant run-time which would copy all the strings. And there was no reasonable way to fix the performance bug without causing unpredictable performance regressions in other valid programs.

What I'm getting to here is that the performance characteristics of strings, dictionaries and lists is part of the contract. Even if left unspecified, a lot of code in the wild will start depending on it. And once you specify the exact behavior and complexity of all the basic operations on a container, this leaves very little room for changing the internal representation in a significant way.

Over time, Python 3 grew some clever string optimizations to mitigate the overhead of converting strings on the I/O boundary. But these optimizations are inherently data dependent. Let's say I want to read a 100MB html file into a string and then send it over a socket. This will take roughly 100MB of RAM in Python 2, while a Python 3 program will jump from 100MB to 400MB after someone inserted a single emoji in the middle of the file.

While dynamic, garbage-collected languages are expected to have less predictable runtime behavior, the idea that user-controlled input can undo an important optimization that doubles or even quadruples memory usage is terrifying.This could even be used as a DoS vector against Python servers running in containers with hard memory limits. For similar reasons, dicts typically employ a cryptographic hash function to prevent DoS attacks based on causing too many collisions, which would trigger worst-case O(n) lookups for every key.

And this is just one of the several issues that a clever internal representation will bring over a simple one. Another tricky one that comes to mind is round-trip of utf-8 and other encodings, including input containing invalid codepoints. I've seen Python backup software which failed to restore an archive due to a filename containing an invalid utf-8 character (that file came from AmigaOS which didn't have utf-8, and I never noticed because all cli tools simply handled the filename without trying to read too much into it :-)

These are the kind of concerns you wish a good language runtime would hide from you, so you don't need to audit your codebase to make sure nobody accidentally used the built-in string class to read arbitrary user input. Go and Rust took the simple and elegant approach of declaring that strings are internally represented the same way they're represented externally. Python 2 was essentially already the same way, so isn't it weird that it was perceived as a defect so serious to justify a 10-year long migration to get it fixed?

Linux distributions and Python 2

Posted Jun 14, 2018 8:44 UTC (Thu) by kooky (subscriber, #92468) [Link]

I wouldn't say you get nothing.

I've moved several large Flask web apps from python2 to python3. It has been very satisfying.

We were getting quite a few `unicode` trouble tickets. Fix usually just to run as python3. Changes to code usually very minor and done in a few hours.