Linux distributions and Python 2

Posted Jun 16, 2018 18:39 UTC (Sat) by epa (subscriber, #39769)
In reply to: Linux distributions and Python 2 by codewiz
Parent article: Linux distributions and Python 2

Does it really matter how the Python 3 implementation represents Unicode strings internally? It could use UTF-8, or UTF-16, or UCS-4, or even some wacky encoding of its own, and in principle you wouldn't notice any difference. I think what you are saying is that there should be an easier implicit conversion between byte strings and Unicode strings, with UTF-8-ish encoding and decoding rules used by default.

Linux distributions and Python 2

Posted Jun 24, 2018 14:40 UTC (Sun) by codewiz (subscriber, #63050) [Link]

Well, it does matter for both performance and memory usage, which are observable properties of a program as much as its output.

Earlier versions of the C++ standard intentionally left some things unspecified to give implementations a certain degree of freedom in when choosing the internal representations of data structures, etc. Turned out it was a terrible idea in practice, because some implementations, including GCC, chose to make std::string copy-on-write with reference counting, some had a small-string optimization which would save an allocation, and others would append the \0 on the fly only when someone invoked c_str(), occasionally causing the string to be reallocated.

This caused portability issues where a valid and idiomatic C++ program which would normally execute in 1 second on Linux could take hours and run out of memory on a different standard-compliant run-time which would copy all the strings. And there was no reasonable way to fix the performance bug without causing unpredictable performance regressions in other valid programs.

What I'm getting to here is that the performance characteristics of strings, dictionaries and lists is part of the contract. Even if left unspecified, a lot of code in the wild will start depending on it. And once you specify the exact behavior and complexity of all the basic operations on a container, this leaves very little room for changing the internal representation in a significant way.

Over time, Python 3 grew some clever string optimizations to mitigate the overhead of converting strings on the I/O boundary. But these optimizations are inherently data dependent. Let's say I want to read a 100MB html file into a string and then send it over a socket. This will take roughly 100MB of RAM in Python 2, while a Python 3 program will jump from 100MB to 400MB after someone inserted a single emoji in the middle of the file.

While dynamic, garbage-collected languages are expected to have less predictable runtime behavior, the idea that user-controlled input can undo an important optimization that doubles or even quadruples memory usage is terrifying.This could even be used as a DoS vector against Python servers running in containers with hard memory limits. For similar reasons, dicts typically employ a cryptographic hash function to prevent DoS attacks based on causing too many collisions, which would trigger worst-case O(n) lookups for every key.

And this is just one of the several issues that a clever internal representation will bring over a simple one. Another tricky one that comes to mind is round-trip of utf-8 and other encodings, including input containing invalid codepoints. I've seen Python backup software which failed to restore an archive due to a filename containing an invalid utf-8 character (that file came from AmigaOS which didn't have utf-8, and I never noticed because all cli tools simply handled the filename without trying to read too much into it :-)

These are the kind of concerns you wish a good language runtime would hide from you, so you don't need to audit your codebase to make sure nobody accidentally used the built-in string class to read arbitrary user input. Go and Rust took the simple and elegant approach of declaring that strings are internally represented the same way they're represented externally. Python 2 was essentially already the same way, so isn't it weird that it was perceived as a defect so serious to justify a 10-year long migration to get it fixed?