An ordered set for Python?
Python has lots of different options for mutable data structures, both directly in the language and in the standard library. Lists, dictionaries (or "dicts"), and sets are the foundation, but two of those maintain an order based on how the elements are added, while sets do not. A recent discussion on the Python Discourse forum raised the idea of adding an ordered variant of sets; while it does not look like there is a big push to add the feature, the discussion did show some of what is generally needed to get new things into the language—and could well lead to its inclusion.
By their very nature, Python lists have always been ordered; they can also be indexed like arrays. On the other hand, Python dicts started off as unordered, so that adding two entries to a dict could result in either order when, say, iterating over the keys. Dicts would normally maintain the same order if no additions or deletions were made to them, but it was not guaranteed by the language. That all changed when a new implementation of dicts for Python 3.6 maintained the insertion order as a side-effect of a more memory-efficient algorithm. In Python 3.7, ordered dicts were adopted as part of the Python language, so all implementations have to support that feature.
There is also the longstanding collections.OrderedDict implementation in the standard library; it is optimized for reordering efficiency, which makes it a good choice for least-recently-used (LRU) caches, for example. The standard dict is optimized for mapping operations and insertion speed, so there are (still) good reasons to have both available. But, since the existence of OrderedDict pre-dated the switch to ordered dicts in the language, to some it seems like it might provide a precedent for an OrderedSet data structure in the standard library as well.
But sets in both math and Python are just containers for some items—objects—with no duplicates. The operations available for sets are what would generally be expected: membership (using "in"), union, intersection, difference, subset, and so on. The order of the elements in a set is effectively random.
>>> { 'foo', 'bar', 'baz' }
{'baz', 'foo', 'bar'}
>>> { 'foo', 'bar', 'baz', 0.2 }
{'baz', 'foo', 0.2, 'bar'}
>>> { 'foo', 'bar', 'baz', 0.2, None }
{0.2, 'baz', 'foo', 'bar', None}
>>> s = { 'foo', 'bar', 'baz', 0.2 }
>>> s.add(None)
>>> s
{0.2, 'foo', 'baz', 'bar', None}
As can be seen, even adding the same elements in the same order can
result in differences in the representation of the set.
OrderedSet
Some developers have use cases for sets that maintain the order that its elements were added in, however. Two Python Package Index (PyPI) modules, ordered_set and orderedset, came up in the discussion as evidence that the feature is useful. They both provide an OrderedSet class that acts like something of a cross between a set and a list; no duplicates are allowed, but, since the order is maintained, indexing can be done. Both started from an OrderedSet recipe by Raymond Hettinger, but modified it with different tradeoffs (one to a simpler Python implementation and the other using Cython for speed).
The idea of adding OrderedSet to the language was raised (this time) back in December by "Pyprohly". They noted that the lack can be worked around using a dict but that it is ugly to have to do so. The message may have escaped much notice due to the holidays, but the topic was revived by Justin Gerber on October 21. He noted the PyPI packages, but thought it was unfortunate that a project would have to add an external dependency simply for an ordered set. The workarounds are suboptimal as well:
Say my ordered set contains 'apples' and 'oranges'. I can doordered_set = dict() ordered_set['apples'] = None ordered_set['oranges'] = Noneordered_set will then act in some nice ways like an ordered set. For example we can do things like for element in ordered_set... and it will behave as expected. One major shortcoming is when we want to printout the ordered set: print(ordered_set) will reveal the ugly dictionary structure. This ordered_set variable also lacks typical set syntax like ordered_set.add('bananas'). I doubt set operations like unions or intersections are easily available either.
He asked what kind of rationale would be required in order to add
OrderedSet to collections. But, "Laurie O" wondered
why adding external dependencies was such a problem; there are many
benefits and only one drawback, dependency resolution conflicts, that they
could think of. Steven D'Aprano noted
that external dependencies have more disadvantages than indicated; there
are organizations that cannot easily add them to their projects for legal
or procedural reasons, as well as
people who lack the reliable (and unfiltered) internet
access needed. "Whether due to economics or politics or some other
reason, the ability to just run pip install ... is not a
universal
privilege.
"
In another message, D'Aprano mentioned some of the use cases he sees for ordered sets, but he did recognize that there is a cost to adding the feature. Pyprohly also described their use case. Most of the uses rely on the predictable behavior for iteration or removing (via pop()) elements from the set that would come with an ordered version.
Gerber agreed
with D'Aprano's assessment of the downsides of external dependencies; he
had to submit a request to add a PyPI dependency for his project, for one
thing, but he also ran into some problems building and installing the
Cython-based package. "Suffice it to say I am having to pay a time cost
to include OrderedSet in my project. And the question remains, why
must I pay a cost for OrderedSet but not
OrderedDict?
" He also outlined his particular use case, where
it would be convenient to get things out of the set in the same order they
were put in.
Feature addition
The process for turning an idea, like the one in the discussion, into an
actual feature in the language was
unclear to Gerber; he is new to Python development and wanted to
understand what "the cultural norm/bar is for including something in the
standard library and what sorts of considerations are typically made
".
D'Aprano described
the path for a change of this sort:
- Gather as much community feedback as you can, at a minimum here on Discuss or the Python-Ideas mailing list. (You can also discuss it on places like Reddit, or any other forum you like.)
- If the community seems to get behind the idea, or at least not be strongly opposed, the next step is to ask for a sponsor from the core developers.
- If at least one core dev is willing to act as sponsor, you (or somebody) should write a PEP proposing the feature.
- The PEP then gets sent back here for additional rounds of feedback.
- If there is sufficient community interest in the PEP, the author then asks the Steering Council to accept the PEP.
- If they accept it, then somebody (often the PEP author) will implement it and add it to the std lib.
He did caution that a lack of a volunteer implementer could derail the process even after the council approves it. Gerber said that he would use PEP 372 ("Adding an ordered dictionary to collections") as a model for what would be needed in a PEP for the ordered-set feature. He also reported that the maintainer of the Cython-based OrderedSet PyPI package was in favor of adding the feature to the language, though obviously not the Cython version.
Pyprohly weighed
in again with a summary of the use cases and some additional reasons
why the feature makes sense, including that Java has a similar construct
available. Gerber reported
that the maintainer of the other PyPI package was also in favor of the
feature though it "should probably be different than the implementation
in that package and that it should derive from new default dict
orderedness
".
Paul Moore pointed
out
that there were now two third-party implementations of the feature that
did not want their code added to the standard library but
"who are in favour of someone else writing a stdlib
implementation
". That's completely reasonable, of course, but still
leaves open the question of who will do it:
This feels to me like a borderline case at the moment. No-one is particularly against the idea, but no-one is offering to do the work. I suspect that this will go nowhere unless/until someone writes an implementation.Maybe at this point the next step should be to create a PR [pull request] adding OrderedSet to collections. That might get some core dev attention and maybe a sponsor for a PEP.
That's where things stand at this point, but it has only been two days or so since Moore's response. Given Gerber's interest in seeing this feature get added, and that he would like to get involved with Python development, it seems likely that he will be looking to follow-up on that plan. The lack of an ordered set is perhaps a bit of a language wart, and the feature should not be all that difficult to add (and maintain); perhaps Python 3.12, which is due next October, will come with collections.OrderedSet right next to its much older OrderedDict cousin.
| Index entries for this article | |
|---|---|
| Python | Sets |
