Reworking StringIO concatenation in Python
Python string objects are immutable, so changing the value of a string requires that a new string object be created with the new value. That is fairly well-understood within the community, but there are some "anti-patterns" that arise; it is pretty common for new users to build up a longer string by repeatedly concatenating to the end of the "same" string. The performance penalty for doing that could be avoided by switching to a type that is geared toward incremental updates, but Python 3 has already optimized the penalty away for regular strings. A recent thread on the python-ideas mailing list explored this topic some.
Paul Sokolovsky posted his lengthy description of a fairly simple idea on March 29. The common anti-pattern of building up a string might look something like:
buf = ""
for i in range(50000):
buf += "foo"
print(buf)
buf = io.StringIO()
for i in range(50000):
buf.write("foo")
print(buf.getvalue())
To make it easier for existing programs to be switched from one form to the
other, he suggested adding a "+=" operator for StringIO
as an alias for the write() method.
Adding an __iadd__()
method for the StringIO class would allow the write() call to be removed in
favor of using +=. The buffer initialization and
getvalue() call would still be needed, but those are each typically
done in only one place, while the concatenation may be done in multiple
places. So a code base could fairly easily be switched from the
anti-pattern to more proper Python just by creating the buffer instead of a
string and getting its value where needed with getvalue(), the
rest of the code can stay the same; "it will leave the rest of code intact,
and not obfuscate the original content construction algorithm
".
As Sokolovsky noted, his performance benchmarking shows that CPython 3 has already optimized for the anti-pattern, though. So even though it is still considered to be a bad practice, there is no real penalty for writing code of that sort in CPython 3—but only for that version of the language:
The optimization, which is described by Paul Ganssle in a blog post, effectively allows CPython to treat the string as mutable in the case where there are no other references to it. In a loop like the one in the example, there is no other reference to the string object being used, so instead of creating a new object and freeing the old, it simply changes the existing object in place. CPython can detect that case because it uses reference counts on its objects for garbage collection; PyPy is not reference-counted, so it cannot use the same trick.
But Sokolovsky is trying to target the bad practice regardless of the (lack of a)
performance impact. The practice is widespread; the optimization added to
CPython is evidence that it needs addressing, he said. He suggested that
other implementations can either follow the lead of CPython (if possible)
or try to promote better practices: "This would require improving
ergonomics of existing string buffer
object, to make its usage less painful for both writing new code and
refactoring existing.
" And, of course, he was advocating the latter.
He also noted that, since the performance problem does not really exist for
CPython, it might be seen as an argument that there is nothing to
fix. "This is related to a bigger
[question] 'whether a life outside CPython exists', or put more
formally, where's the border between Python-the-language and
CPython-the-implementation.
" Beyond that, one could "fix" the
problem by creating a new class derived from StringIO that has an
__iadd__(), but that suffers from worse performance as well, which
argues that the problem should be addressed in C in StringIO
itself.
The overall reception to the idea was chilly, at best, perhaps partly fueled by Sokolovsky's somewhat aggressive tone in his original note and some of the followups. Andrew Barnert replied that the join() mechanism is really the better alternative:
Barnert said that StringIO is meant to be a file object that resides in memory, so it is appropriate that its API does not support +=. He concluded with a third option for alternative Python implementations beyond the two that Sokolovsky presented:
The problem with the join() mechanism is that it is somewhat non-intuitive, especially for those coming to Python from another language. As Barnert noted, though, it can use more memory as well. Sokolovsky attempted to measure the difference in memory use, but the technique he used was not entirely convincing. His focus would appear to be on embedded Python, such as his Pycopy Python implementation. Pycopy is descended from MicroPython, which he also worked on. For the embedded use case, StringIO may well be the better choice for building strings, at least from a memory perspective; is that enough of a reason to turn a file-like object (StringIO) into a string-like object, but only for concatenation (+=)? The consensus answer would seem to be "no".
There was some discussion of having a generalized mutable string type,
though that was not at all what Sokolovsky was after; there are some good reasons
why that idea has never really taken off for Python, as Christopher Barker
described. "So
I'd say it hasn't been done because (1) it's a lot of work and (2) it would
be a bit of a pain to use, and not gain much at all.
"
The objections to the original idea are basically that += can be trivially
implemented for a derived class of StringIO; if the performance of
that is not sufficient, switching to join() would fix that
problem.
The existing "join() on a list of strings" idiom works well for most people and nearly
all use cases; it is the preferred way
to solve this problem in Python, so making another idiom more usable
is muddying the water to a certain degree. As The Zen of Python puts it:
"There should be one-- and preferably only one --obvious way to do
it.
"
On the other hand, CPython is the dominant player in the ecosystem, as Steven D'Aprano pointed out; that means applications can be written to take advantage of CPython quirks. On the other hand, even if all of the other Python implementations agreed on a change, it will not really be used unless CPython follows suit.
But unless CPython does so too, it won't do them much good, because hardly anyone will take advantage of it. When one platform dominates 90% of the ecosystem, one can sensibly write code that depends on that platform's specific optimizations, but going the other way, not so much.
That is something for the CPython community to keep in mind. The existence of the other implementations of the language may provide opportunities to make some changes that are meant to be CPython-only (or at least not mandated for Python the language). But those changes can still get baked into the language via the back door—because most Python code runs on CPython.
In the final analysis, it is a pretty miniscule change being sought. The existence of the string concatenation optimization indicates that there is interest in helping "badly written" code to some extent, but perhaps adding += to StringIO is a bridge too far. There definitely does not seem to be any kind of groundswell of support for the idea and there are costs, beyond just the (minimal) code maintenance required, including in documentation and user education. The benefits, which some find to be dubious to begin with, are seemingly not enough to outweigh them.
| Index entries for this article | |
|---|---|
| Python | Enhancements |
