Development

Python and to_file()

By Jake Edge
April 6, 2016

Python is used in a wide variety of circumstances, by people coming from different backgrounds and with various needs. A recent thread on the python-ideas mailing list thread started with a thought about a quick and easy way to write a string to a file, much like is done in some other, specialized languages (e.g. R, MATLAB). It soon expanded in several directions, partly into a philosophical consideration of the role of the language—and how best to accommodate those coming to Python from those other languages.

Nick Eubank kicked off the discussion by noting that he is social scientist trying to help his colleagues move away from the specialized languages to Python. "One of the behaviors I've found unnecessarily difficult to explain is the 'file.open()/file.close()' idiom (or, alternatively, context managers)." In other settings, saving to a file is a one-step operation, he said, so there is a conceptual hurdle for those new to Python:

I understand there are situations where an open file handle is useful, but it seems a simple `to_file` method on strings (essentially wrapping a context-manager) would be really nice, as it would save users from learning this idiom.

While there were objections that a single line "write string to named file" operation was an unneeded niche function that would lead to bad habits and slower code, there was also some support for the idea. Andrew Barnert said that he often uses the feature in other languages but, given Python's feature set, he would rarely need it in Python: "But for users migrating to Python from another language, or using Python occasionally while primarily using another language, I can see it being a lot more attractive." On the other hand, Chris Barker was confused by the need for it at all:

and what's wrong with:

open(a_path, 'w').write(the_string)

short, simple one-liner.

OK, non cPython garbage collection may leave that file open and dangling, but we're talking the quicky scripting data analysis type user -- the script will terminate soon enough.

Barker went on to note that NumPy offers ways to directly read and write arrays to and from files. There is also a one-liner for reading a string back in:

    string = open(path).read()

It does suffer from the same problem Barker mentioned for Python implementations that do not use reference-counted garbage collection, however.

Nick Coghlan chimed in with the idea that Python is used for both scripting and application development, but its tutorials and such typically focus on the application side, where "relying on the GC for external resource cleanup isn't a great habit to get into". Thus that introductory material will show the "deterministic cleanup form" for writing to a file. That means that the user has to expand their mental model:

User model: "I want to save this data to disk at a particular location and be able to read it back later"

By contrast, unpacking the steps in the one-liner:

- open the nominated location for writing (with implied text encoding & error handling)
- write the data to that location

It's that switch from a 1-step process to a 2-step process that breaks flow, rather than the specifics of the wording in the code (Python 3 at least improves the hidden third step in the process by having the implied text encoding typically be UTF-8 rather than ASCII).

Eubank strongly agreed with Coghlan's formulation and added:

Python is beautiful in part because it takes all sorts of idioms that are complicated in other languages and wraps them up in something simple and intuitive ("in" has to be my favorite builtin of all time). This feels like one of those few cases where Python still feels like C, and it's not at all clear to my why that needs to be the case.

In his post, Coghlan also suggested a "radical notion" that would create some kind of save/load function that would automatically use UTF-8 and JSON. That would mean it could be used for objects more complicated than strings and would create files in a well-defined format. He noted that there would be some benefits to that approach:

There'd also be a potential long term security benefit here, as folks are often prone to reaching for pickle to save data structures to disk, which creates an arbitrary code execution security risk when loading them again later. Explicitly favouring the less dangerous JSON as the preferred serialisation format can help nudge people towards safer practices without forcing them to consciously think through the security implications.

Barker suggested that perhaps the language definition could expand to require immediate garbage collection when it is known that the created object is no longer used (as in his one-liner save above). Since the result is never assigned to a variable, the file object created goes immediately out of scope, thus it could be reclaimed. The CPython reference implementation already does this, but as Brett Cannon pointed out, mandating that is not something the core developers are likely to want to do:

It's not desirable to dictate what various Python implementations must make sure that their garbage collector supports direct collection like this for this specific case rather than staying general. It would very much be a shift in the definition of the Python project where we have purposefully avoided dictating how objects need to behave in terms of garbage collection.

There was also discussion of where such a convenience function should live. Eubank's original idea of adding a to_file() method to the string type was not particularly popular. Eubank had noted that the pathlib module has functionality that is similar to what he was asking for, but "can't imagine anyone looking in the Path library" if they just want to write a string to a file. Koos Zevenhoven pointed out that the needed functions (write_text() and read_text()) had only been added in Python 3.5, which was only released back in September, "so there has not really been time for their adoption". Others noted that there are issues with converting strings to Paths and vice versa, which would open up another can of worms for users who were just looking for a simple way to write their string.

Coghlan suggested that the io module might be the right place, but there are still some fundamental issues. For one thing, as Victor Stinner noted, the simple one-liner "solution" does raise a warning (ResourceWarning when the -Wd flag is used) because the state of the file on disk is unknown; it may or may not have been flushed. There is a longstanding bug regarding a portable way to ensure an atomic write to a file, which Stinner also referred to. While a simple wrapper, using a with context manager, could ensure the file gets closed, it cannot ensure the data reaches the disk, which makes it something of an unsafe operation.

So, what seemed a simple idea at the start ballooned into a much more complex question. Part of the problem is that specialized languages can make lots of assumptions about how they will be used, which allows them to "hide more of the complexities from [their] users", as Coghlan put it. But a language like Python can't do that, except in domain-specific modules such as NumPy or Pandas. There is a question of how to choose the right defaults:

However, for a general purpose language, providing convenient "I don't care about the details, just do something sensible" defaults gets trickier, as not only are suitable defaults often domain specific, you have a greater responsibility to help folks figure out when they have ventured into territory where those defaults are no longer appropriate. If you can't figure out a reasonable set of default behaviours, or can't figure out how to nudge people towards alternatives when they start hitting the limits of the default behaviour, then you're often better off ducking the question entirely and getting people to figure it out for themselves.

While there seemed to be general agreement that the io module would be the right place to put a convenience function of this sort, it is not clear if there is enough support to do so. The alternatives are reasonably readable and understandable; even the proper with form is not that hard to grasp. There will be hurdles when moving from a specialized to a general-purpose language—the advantages of the latter should make clearing the hurdles worth it, at least for some. The conversation did provide an interesting look into the thinking process that goes on in Python circles, though.

Comments (8 posted)

Brief items

Quotes of the week

I look forward to seeing the UI toolkit written by you in two weeks.

— Dirk Hohndel at ELC 2016, in reply to Linus Torvalds's comment that he would start a fourth software project "when something irritates me and I think 'I can do that in two weeks.'"

Py3 support is like an unemployed cousin we're letting crash on the couch: we're already annoyed that it's here, so it should try not to stack up dirty dishes everywhere.

— Matt Mackall

This is the part where people show you some number and put a "B" next to it. You get to pick any number you want; I picked 25.

— Raj Talluri at ELC 2016, explaining his presentation slide that predicted "the future of IoT" as "25B devices" strong.

Comments (none posted)

Mono Relicensed MIT

At the Mono Project blog, Miguel de Icaza announced that the Mono runtime has been relicensed, moving from a dual-license slate (with LGPLv2 and proprietary optiona) to the MIT license. The Mono compiler and class libraries were already under the MIT license and will remain so. "Moving the Mono runtime to the MIT license removes barriers to the adoption of C# and .NET in a large number of scenarios, embedded applications, including embedding Mono as a scripting engine in game engines or other applications." De Icaza notes that Xamarin (which was recently acquired by Microsoft) had developed several proprietary Mono modules in recent years; these will also now be released under the MIT license.

Comments (168 posted)

Discourse 1.5 released

Version 1.5 of the Discourse open-source discussion-and-commenting system has been released. Significant work went into rewriting the top-level "topics" page, resulting in a five-fold speed increase. Administrators can now change and customize every object label used in the interface. "Want topics to be 'threads'? Users to be 'funkatrons'? Like to be 'brofist'? Well, Discourse is your huckleberry." Support for email comments has also been improved, and user groups can now exchange private messages. The badge system, which is used to denote user roles and to mark popular posts, received a visual refresh and new documentation; user summary pages were also refreshed.

Comments (12 posted)

Exim 4.87 Released

Version 4.87 of the Exim mail transfer agent has been released. Several formerly experimental features are now marked as fully supported, including internationalized mail addressing, SOCKS support, REDIS support, and events. There are also many new expansion variables available, and improvements to the regular-expression support in ACLs.

Full Story (comments: none)

LXC 2.0 released

Version 2.0 of the LXC containerization system has been released. Among the changes are more reliable checkpoint and restore, improved control-group handling, and many bug fixes. Also of note is that LXC 2.0 is designated a long-term support release; backported security updates and bugfixes will be provided for the next five years.

Full Story (comments: none)

Rkt 1.3.0 released

Version 1.3.0 of the rkt container system has been released. "rkt version 1.3.0 improves handling of errors within app containers, tightens security for rkt’s modular stage1 images, and provides a more compatible handling of volumes when executing Docker container images rather than rkt’s native ACI image format. This release further develops the essential support for rkt as a component of the Kubernetes cluster orchestrator."

Comments (none posted)

Newsletters and articles

Development newsletters from the past week

Comments (none posted)

KDE Presents its Vision for the Future

The KDE project has released a vision statement, a single sentence that sums up what the project would like to achieve: A world in which everyone has control over their digital life and enjoys freedom and privacy. "Our vision unites KDE in common purpose. It sets out where we want to get to, but it provides no guidance on how we should get there. After finalizing our vision (the "what"), we have immediately started the process of defining KDE's Mission Statement (the "how"). As with all things KDE, you are invited to contribute. You can easily add your thoughts on our mission brainstorming wiki page." (Thanks to Paul Wise)

Comments (8 posted)

Page editor: Nathan Willis
Next page: Announcements>>