|
|
Subscribe / Log in / New account

Python data classes

By Jake Edge
November 29, 2017

The reminder that the feature freeze for Python 3.7 is coming up fairly soon (January 29) was met with a flurry of activity on the python-dev mailing list. Numerous Python enhancement proposals (PEPs) were updated or newly proposed; other features or changes have been discussed as well. One of the updated PEPs is proposing a new type of class, a "data class", to be added to the standard library. Data classes would serve much the same purpose as structures or records in other languages and would use the relatively new type annotations feature to support static type checking of the use of the classes.

PEP 557 ("Data Classes") came out of a discussion on the python-ideas mailing list back in May, but its roots go back much further than that. The attrs module, which is aimed at reducing the boilerplate code needed for Python classes, is a major influence on the design of data classes, though it goes much further than the PEP. attrs is not part of the standard library, but is available from the Python Package Index (PyPI); it has been around for a few years and is quite popular with many Python developers. The idea behind both attrs and data classes is to automatically generate many of the "dunder" methods (e.g. __init__(), __repr__()) needed, especially for a class that is largely meant to hold various typed data items.

Python's named tuples are another way to easily create a class with named data items, but they suffer from a number of shortcomings. For one, they are still tuples, so two named tuples with the same set of values will compare as equal even if they have different names for the "fields". In addition, they are always immutable (like tuples) and their values can be accessed by indexing (e.g. nt[2]), which can lead to confusion and bugs.

As the "Rationale" section of the PEP notes, there are various descriptions out there of ways to support data classes in Python, along with people posting questions about how to do that kind of thing. For many, attrs provides what they need (Twisted developer Glyph Lefkowitz championed the module in a blog post in 2016, for example), but it also provides more than some need. Beyond that, discussion in the GitHub repository for the data classes project indicated that attrs moves too quickly and has extra features that make it not suitable for standard library inclusion. Data classes are meant to be a simpler, standard way to have some of that functionality:

Data Classes are not, and are not intended to be, a replacement mechanism for all of the above libraries. But being in the standard library will allow many of the simpler use cases to instead leverage Data Classes. Many of the libraries listed have different feature sets, and will of course continue to exist and prosper.

Eric V. Smith picked up the suggestion of writing a PEP from the python-ideas thread and posted the first version to python-dev back in September. The first set of comments was the inevitable bikeshedding over the name, which continued even after Guido van Rossum asked that it stop. Van Rossum is satisfied with the "data classes" name, though others like "record", "struct", and the like. There were some more technical comments made in that thread, which Smith incorporated into the revision he posted about in late November.

The overall goal is to reduce the boilerplate that needs to be written for a class with typed data fields. To that end, there is an @dataclass decorator (in the dataclasses module) that processes the class definition to find typed fields. It then generates the various dunder methods and attaches them to the class, which is then returned by the decorator. It would look something like the following example from the PEP:

    @dataclass
    class InventoryItem:
	'''Class for keeping track of an item in inventory.'''
	name: str
	unit_price: float
	quantity_on_hand: int = 0

	def total_cost(self) -> float:
	    return self.unit_price * self.quantity_on_hand

The total_cost() method was put into the example to help show that a data class is simply a regular class and can have its own methods, be subclassed, and so on. From the above declaration, InventoryItem would automatically get a properly type-annotated __init__() method, along with a __repr__() that produces a descriptive string and a bunch of comparison operators (e.g. __eq__(), __lt__(), __ge__()). None of those need to be written or maintained by the developer.

More fine-grained control over the generated methods is available using parameters passed to the dataclass() decorator. There is a handful of boolean flags that determine whether certain methods are generated (init, repr, eq, compare); the latter two allow only generating equality methods (__eq__() and __ne__()) or generating the full set of comparison methods. These methods all test objects as if they were a tuple of the fields in the order specified in the class definition.

There was some discussion of how to handle comparisons between objects that have different types. Obviously, comparing unrelated objects should raise an exception (NotImplemented), but for subclasses that don't add any fields, an argument could be made that the comparison should be done. Smith considered using an isinstance() check, but ended up taking the lead from attrs and sticking with strict type checks for all of the comparison operators. This GitHub issue has a bit more discussion, including that attrs is actually only strict for the equality operators—something attrs author Hynek Schlawack called an oversight.

There are two other flags for dataclass() that govern whether the class is "frozen" (emulating immutability by raising an exception when any field is assigned to) and whether a __hash__() method will be generated (thus allowing objects to be used as dictionary keys). The two are somewhat intertwined (and interact with the eq flag as well), so the flag interpretations reflect that:

If eq and frozen are both true, Data Classes will generate a __hash__ method for you. If eq is true and frozen is false, __hash__ will be set to None, marking it unhashable (which it is). If eq is false, __hash__ will be left untouched meaning the __hash__ method of the superclass will be used (if the superclass is object, this means it will fall back to id-based hashing).

Although not recommended, you can force Data Classes to create a __hash__ method with hash=True. This might be the case if your class is logically immutable but can nonetheless be mutated. This is a specialized use case and should be considered carefully.

That all seems a little clunky, but it is likely to be a fairly fringe feature that will not see much use.

Fields can be specified using the type annotation syntax (as in the example above), but more control is available using the field() function. That allows fields to be removed from the generated methods using the init, repr, compare, and hash flags. It also provides a way to set the default value, since using field() precludes the usual way to set a default, as an example in the PEP shows:

    @dataclass
    class C:
	x: int
	y: int = field(repr=False)
	z: int = field(repr=False, default=10)
	t: int = 20

Beyond that, there can be a default_factory passed to create new empty objects (e.g. dict, list) for the field, since using [] or {} directly would result in all objects sharing the same list or dictionary. There is also a metadata parameter that can set some user-specific metadata on the Field objects that are created for each field in a data class (and can be retrieved using the fields() method in dataclasses).

There are some other module-level helper functions, such as asdict() and astuple() to convert a data class to a dict or tuple; isdataclass() allows checking to see if an object is an data class instance. There is more to the data class specification, but the summary above hits most of the high notes.

So far, there have been few real objections to the idea. Given that Van Rossum has been actively participating in the threads (and suggested writing a PEP), it would seem highly likely that he will accept the PEP for 3.7. There is working code in the GitHub repository, so there should be little that stands in its way.

The process followed here is an excellent example of how Python development works. Something was posted to python-ideas that was not particularly "Pythonic", it was discussed and a path forward was identified, a PEP was written and has been reviewed by many, changes were made, and we are on the cusp of seeing it in a release. All of that took roughly half a year, though much of the groundwork was laid some time ago. Clearly not all features have such a smooth path—or even any path—into Python, but ideas whose time has come can be adopted fairly rapidly.


Index entries for this article
PythonPython Enhancement Proposals (PEP)/PEP 557


to post comments

Python data classes

Posted Nov 30, 2017 3:35 UTC (Thu) by flewellyn (subscriber, #5047) [Link] (3 responses)

This is very cool. I wonder if the attrs library will incorporate this, so as to avoid duplication?

Python data classes

Posted Nov 30, 2017 12:29 UTC (Thu) by craniumslows (guest, #114021) [Link]

I would expect the internal class methods to get updated to use this new library while leaving the publicly used methods the same. This would give a seamless user experience while simplifying the code base. That's how I would interpret it anyway. It's pretty neat either way.

Python data classes

Posted Nov 30, 2017 15:13 UTC (Thu) by hynek (subscriber, #55397) [Link] (1 responses)

If by “this” you mean type annotations, I have good news: http://www.attrs.org/en/stable/examples.html#types

Python data classes

Posted Dec 2, 2017 10:36 UTC (Sat) by flewellyn (subscriber, #5047) [Link]

Very cool! But I actually meant "Dataclasses in general" by "this".

Python data classes

Posted Nov 30, 2017 16:34 UTC (Thu) by mgedmin (subscriber, #34497) [Link]

> Obviously, comparing unrelated objects should raise an exception (NotImplemented)

Pedantry: NotImplemented is not an exception. It's a singleton you return from special comparison method (__eq__, __lt__ and friends) to indicate that you don't know how to compare self with the other object, which will lead to Python calling the appropriate comparison method on the other operand. For example, if you try a < b and a.__lt__(b) returns NotImplemented, Python will call b.__gt__(a). If both return NotImplemented, Python will raise a TypeError.

There is an exception with a very similar name (NotImplementedError), which sometimes leads to confusion.

Python data classes

Posted Dec 1, 2017 9:48 UTC (Fri) by hynek (subscriber, #55397) [Link]

> That all seems a little clunky, but it is likely to be a fairly fringe feature that will not see much use.

In case you’re interested about the whole dance around hashing and frozen, it so happened that I’ve written down the involved complexities a few weeks ago, illustrated with practical examples: https://hynek.me/articles/hashes-and-equality/

Python data classes

Posted Dec 3, 2017 20:53 UTC (Sun) by mb (subscriber, #50428) [Link]

Nice feature. Nice article.
I like both very much.

is Python obsolete?

Posted Dec 8, 2017 21:11 UTC (Fri) by HelloWorld (guest, #56129) [Link] (2 responses)

It strikes me that virtually all Python language improvments discussed on LWN are about problems that other languages solved years ago. "data classes" are called "case classes" in Scala and have been around for ages. Same thing for type annotations that were introduced to Python a while back – Scala has had them forever (and they're actually checked by the implementation, believe it or not!). Currently there seems to be a discussion about who should see deprecation warnings – it seems to me that that question simply goes away when using a compiled language. "Delayed execution", aka lazy evaluation, is another such feature.

And on top of that there's all the other stuff that is still missing and has been for years. Decent parallelism support. Or a sane syntax: why retain the arbitrary distinction between statements and expressions? Why limit lambdas to a single line? Why limit "comprehension syntax" to a fixed number of built-in types (lists, maps and iterators, iirc) when it can easily be generalized to support all sorts of things?

All these things make the language feel pretty clunky and annoying to me at this point. Am I missing something? It seems to me that the only reason people keep using it is either because they don't know better languages or because there's so much Python code already out there…

is Python obsolete?

Posted Dec 9, 2017 14:22 UTC (Sat) by FLHerne (guest, #105373) [Link] (1 responses)

No.

There have been a *lot* of languages, almost every problem will have been solved somewhere. The approach of "absorb all the solutions from other languages" leads to C++ or PHP where you have five ways to do everything, none of the constructs fit together well, and it's impossible to have a clean syntax because all the symbols are taken and the new things don't fit pre-existing patterns.

Note -- this proposal doesn't need *any* new syntax, it's just a new standard library module. You could import it in any older version of python3 at least, and I'm sure there'll be such a backported version available. It's not as if Python coders have been missing this functionality, we've just been using 'attrs' and other third-party libs that do the same thing.

--

Parallelism is, as you say, the major Achilles' heel of Python currently. The GIL does make code a lot easier to write and reason about, but it's a nuisance if you really need to run actual Python statements on multiple cores. That's not actually as common a problem as you might think, because both of those requirements need to exist:

- For UIs, network-handling code etc., you need asynchronous code but don't really care if it runs interleaved on the hardware. That's easy now with the async/await statements.
- For maths/scientific/machine-learning code, it would be daft to do the heavy calculations in any interpreted language. There's a huge range of lower-level libraries that can do that (often on the GPU) with good Python bindings, and they drop the GIL while doing that work and/or use parallelism internally, so such things do scale fine across cores.
- For web servers, it's nice to use Python for everything; no cheating and calling out to libs from other libraries. In that case, however, you're handling hundreds of unrelated requests, so each one can have its own process and GIL and again it scales fine across cores.

--

On your specific questions:
> why retain the arbitrary distinction between statements and expressions

- It keeps the parser simple, and helps avoid people writing inscrutable code. Statements that would make sense as expressions already have such a version (comprehensions, lambda).

> Why limit lambdas to a single line?

- Because a lambda is just a function definition, but one that's an expression so it can be written inline. Who wants multi-line functions defined inline? Particularly, who'd want to read that sort of code?

IIRC, Guido's said he wouldn't add 'lambda' at all given another chance.

> Why limit "comprehension syntax" to a fixed number of built-in types (lists, maps and iterators, iirc) when it can easily be generalized to support all sorts of things?

- The generator form already *does* generalise like that. `Zep(foo for foo in bar)` lets you construct arbitrary classes with a comprehension and no intermediate container. Sure, `[foo for foo in bar]` is shorthand for `list(foo for foo in bar)`, but all of () {} [] are used already.

is Python obsolete?

Posted Dec 13, 2017 1:16 UTC (Wed) by HelloWorld (guest, #56129) [Link]

Hi FLHerne,

tl;dr: I stand by my point, Python as a language is obsolete, and the reason people still use it are purely based on the ecosystem around it.

> The approach of "absorb all the solutions from other languages"
I don't recall suggesting anything like that. I merely observed that I don't see which problems Python as a language has interesting solutions for, while I do see that the problems they're currently solving are consistently problems that other languages have solved years ago, often in better ways.

> Note -- this proposal doesn't need *any* new syntax, it's just a new standard library module.
You say that like it's a good thing, but I'm not convinced it is. It uses a decorator to money patch some boilerplate code into the annotated class. While this use of monkey patching is probably fairly innocuous, I don't think it's a good idea in general. It potentially makes it very hard to figure out where some method came from, and it eludes my why somebody who worries about multi-line lambdas making code hard to read would tolerate it.
In Scala, this kind of thing can be done using macros. Classes can be annotated and have their ASTs transformed with arbitrary code, but the crucial point here is that this happens at compile time, rather than at runtime. So you can't do all the nasty things that monkey patching allows, like patching a class that doesn't belong to you etc., making it a cleaner approach in general.

> It's not as if Python coders have been missing this functionality, we've just been using 'attrs' and other third-party libs that do the same thing.
Well yes, but I would argue that such basic functionality really should have been included in the code language a long time ago, simply because if it's not, many people won't use it. Either because they don't know about it or because they don't want to add that dependency.

> - For UIs, network-handling code etc., you need asynchronous code but don't really care if it runs interleaved on the hardware. That's easy now with the async/await statements.
Async/Await is yet another one of the many problems that Python solved years after other languages and in a less useful way. As far as I can tell, this stuff is simply Monad syntax, except it doesn't generalize to other Monads, making it much less useful. When Scala introduced Futures in 2.10, that was a pure library change, because the language already supported Monads, and you don't really need anything beyond that on a language level. So the fact that Python needed to be extended to support that really only proves my point.

> - For maths/scientific/machine-learning code, it would be daft to do the heavy calculations in any interpreted language. There's a huge range of lower-level libraries that can do that (often on the GPU) with good Python bindings, and they drop the GIL while doing that work and/or use parallelism internally, so such things do scale fine across cores.
If you have to resort to (libraries written in) another language to implement performance-sensitive bits of your program, then why bother with Python in the first place? I suspect that most people who like it do so because they compare it to mainstream languages that they know, like Java or C, and they've never worked with more more modern designs like Haskell, Ocaml or Scala.
I'd also like to point out that plenty of attempts were made to improve Python's performance, like Pyston for instance. Like it or not, performance matters, and Python isn't doing well. The reason why Dropbox dropped Pyston is also telling: they simply started switching to Go instead. So I think it's an illusion to think all performance problems can be solved by using some library written in C.

> - For web servers, it's nice to use Python for everything; no cheating and calling out to libs from other libraries. In that case, however, you're handling hundreds of unrelated requests, so each one can have its own process and GIL and again it scales fine across cores.
Well great, but every language can do that, so it's not really an argument for Python at all. I realise it probably wasn't meant to be, but I'd like to hear an argument *for* Python at this point, rather than explanations about why (you think) the problems other people see don't matter.

> - It keeps the parser simple,
Aside from the fact that I can't see how having an additional unnecessary syntactic category (statements) simplifies anything, I don't see why I should care since I don't need to implement it. In fact the whole idea of high-level programming languages is that the effort of implementing them is outweighed by the productivity gains of the users.

> and helps avoid people writing inscrutable code.
I don't buy that, sorry. I really don't think that the way to avoid hard-to-comprehend code is to place arbitrary limitations on the syntax. On the contrary, it will make things more awkward.

> Statements that would make sense as expressions already have such a version (comprehensions, lambda).
Think about this: why do you need for loops when you have comprehensions? Well, you may want to do things in the body of the loop that require statements; if statements didn't exist, you wouldn't really need for loops! Well, except of course that comprehension syntax in Python has another problem: in an expression like `[foo(x) for x in xs]`, the x variable is used before it's introduced, making things hard to follow if the "loop body" is a significantly larger expression than `foo(x)`.
Both of these problems can be fixed, hence Scala only has one syntax for that, and it supports arbitrary Monads, which is why it doesn't need language support for asynchronous programming. My earlier example is written `for (x <- xs) yield foo(x)` in Scala. The fact that Python needs for loops in addition to comprehensions is really just a symptom of the problem I mentioned.

> - Because a lambda is just a function definition, but one that's an expression so it can be written inline. Who wants multi-line functions defined inline? Particularly, who'd want to read that sort of code?
I want that, because I don't see why a multi-line function would be any less useful in an expression context than anywhere else. Scala supports this and I can tell you from years of experience with the language that it doesn't make the code any less intelligible.

> The generator form already *does* generalise like that.
No, it doesn't.

> `Zep(foo for foo in bar)` lets you construct arbitrary classes with a comprehension and no intermediate container.
What you're doing there is merely passing a generator as a constructor parameter, it doesn't change the fact that comprehension syntax will always yield a generator (or list etc.). Which is actually really odd, since most Python syntax *can* be generalized with magic methods like __add__ etc. Here's how it could be done in Python:
http://blog.sigfpe.com/2012/03/overloading-python-list-co...
Alas, the language currently doesn't include that :-(


Copyright © 2017, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds