|
|
Log in / Subscribe / Register

An ordered set for Python?

By Jake Edge
October 26, 2022

Python has lots of different options for mutable data structures, both directly in the language and in the standard library. Lists, dictionaries (or "dicts"), and sets are the foundation, but two of those maintain an order based on how the elements are added, while sets do not. A recent discussion on the Python Discourse forum raised the idea of adding an ordered variant of sets; while it does not look like there is a big push to add the feature, the discussion did show some of what is generally needed to get new things into the language—and could well lead to its inclusion.

By their very nature, Python lists have always been ordered; they can also be indexed like arrays. On the other hand, Python dicts started off as unordered, so that adding two entries to a dict could result in either order when, say, iterating over the keys. Dicts would normally maintain the same order if no additions or deletions were made to them, but it was not guaranteed by the language. That all changed when a new implementation of dicts for Python 3.6 maintained the insertion order as a side-effect of a more memory-efficient algorithm. In Python 3.7, ordered dicts were adopted as part of the Python language, so all implementations have to support that feature.

There is also the longstanding collections.OrderedDict implementation in the standard library; it is optimized for reordering efficiency, which makes it a good choice for least-recently-used (LRU) caches, for example. The standard dict is optimized for mapping operations and insertion speed, so there are (still) good reasons to have both available. But, since the existence of OrderedDict pre-dated the switch to ordered dicts in the language, to some it seems like it might provide a precedent for an OrderedSet data structure in the standard library as well.

But sets in both math and Python are just containers for some items—objects—with no duplicates. The operations available for sets are what would generally be expected: membership (using "in"), union, intersection, difference, subset, and so on. The order of the elements in a set is effectively random.

    >>> { 'foo', 'bar', 'baz' }
    {'baz', 'foo', 'bar'}
    >>> { 'foo', 'bar', 'baz', 0.2 }
    {'baz', 'foo', 0.2, 'bar'}
    >>> { 'foo', 'bar', 'baz', 0.2, None }
    {0.2, 'baz', 'foo', 'bar', None}

    >>> s = { 'foo', 'bar', 'baz', 0.2 }
    >>> s.add(None)
    >>> s
    {0.2, 'foo', 'baz', 'bar', None}
As can be seen, even adding the same elements in the same order can result in differences in the representation of the set.

OrderedSet

Some developers have use cases for sets that maintain the order that its elements were added in, however. Two Python Package Index (PyPI) modules, ordered_set and orderedset, came up in the discussion as evidence that the feature is useful. They both provide an OrderedSet class that acts like something of a cross between a set and a list; no duplicates are allowed, but, since the order is maintained, indexing can be done. Both started from an OrderedSet recipe by Raymond Hettinger, but modified it with different tradeoffs (one to a simpler Python implementation and the other using Cython for speed).

The idea of adding OrderedSet to the language was raised (this time) back in December by "Pyprohly". They noted that the lack can be worked around using a dict but that it is ugly to have to do so. The message may have escaped much notice due to the holidays, but the topic was revived by Justin Gerber on October 21. He noted the PyPI packages, but thought it was unfortunate that a project would have to add an external dependency simply for an ordered set. The workarounds are suboptimal as well:

Say my ordered set contains 'apples' and 'oranges'. I can do
ordered_set = dict()
ordered_set['apples'] = None
ordered_set['oranges'] = None
ordered_set will then act in some nice ways like an ordered set. For example we can do things like for element in ordered_set... and it will behave as expected. One major shortcoming is when we want to printout the ordered set: print(ordered_set) will reveal the ugly dictionary structure. This ordered_set variable also lacks typical set syntax like ordered_set.add('bananas'). I doubt set operations like unions or intersections are easily available either.

He asked what kind of rationale would be required in order to add OrderedSet to collections. But, "Laurie O" wondered why adding external dependencies was such a problem; there are many benefits and only one drawback, dependency resolution conflicts, that they could think of. Steven D'Aprano noted that external dependencies have more disadvantages than indicated; there are organizations that cannot easily add them to their projects for legal or procedural reasons, as well as people who lack the reliable (and unfiltered) internet access needed. "Whether due to economics or politics or some other reason, the ability to just run pip install ... is not a universal privilege."

In another message, D'Aprano mentioned some of the use cases he sees for ordered sets, but he did recognize that there is a cost to adding the feature. Pyprohly also described their use case. Most of the uses rely on the predictable behavior for iteration or removing (via pop()) elements from the set that would come with an ordered version.

Gerber agreed with D'Aprano's assessment of the downsides of external dependencies; he had to submit a request to add a PyPI dependency for his project, for one thing, but he also ran into some problems building and installing the Cython-based package. "Suffice it to say I am having to pay a time cost to include OrderedSet in my project. And the question remains, why must I pay a cost for OrderedSet but not OrderedDict?" He also outlined his particular use case, where it would be convenient to get things out of the set in the same order they were put in.

Feature addition

The process for turning an idea, like the one in the discussion, into an actual feature in the language was unclear to Gerber; he is new to Python development and wanted to understand what "the cultural norm/bar is for including something in the standard library and what sorts of considerations are typically made". D'Aprano described the path for a change of this sort:

  • Gather as much community feedback as you can, at a minimum here on Discuss or the Python-Ideas mailing list. (You can also discuss it on places like Reddit, or any other forum you like.)
  • If the community seems to get behind the idea, or at least not be strongly opposed, the next step is to ask for a sponsor from the core developers.
  • If at least one core dev is willing to act as sponsor, you (or somebody) should write a PEP proposing the feature.
  • The PEP then gets sent back here for additional rounds of feedback.
  • If there is sufficient community interest in the PEP, the author then asks the Steering Council to accept the PEP.
  • If they accept it, then somebody (often the PEP author) will implement it and add it to the std lib.

He did caution that a lack of a volunteer implementer could derail the process even after the council approves it. Gerber said that he would use PEP 372 ("Adding an ordered dictionary to collections") as a model for what would be needed in a PEP for the ordered-set feature. He also reported that the maintainer of the Cython-based OrderedSet PyPI package was in favor of adding the feature to the language, though obviously not the Cython version.

Pyprohly weighed in again with a summary of the use cases and some additional reasons why the feature makes sense, including that Java has a similar construct available. Gerber reported that the maintainer of the other PyPI package was also in favor of the feature though it "should probably be different than the implementation in that package and that it should derive from new default dict orderedness".

Paul Moore pointed out that there were now two third-party implementations of the feature that did not want their code added to the standard library but "who are in favour of someone else writing a stdlib implementation". That's completely reasonable, of course, but still leaves open the question of who will do it:

This feels to me like a borderline case at the moment. No-one is particularly against the idea, but no-one is offering to do the work. I suspect that this will go nowhere unless/until someone writes an implementation.

Maybe at this point the next step should be to create a PR [pull request] adding OrderedSet to collections. That might get some core dev attention and maybe a sponsor for a PEP.

That's where things stand at this point, but it has only been two days or so since Moore's response. Given Gerber's interest in seeing this feature get added, and that he would like to get involved with Python development, it seems likely that he will be looking to follow-up on that plan. The lack of an ordered set is perhaps a bit of a language wart, and the feature should not be all that difficult to add (and maintain); perhaps Python 3.12, which is due next October, will come with collections.OrderedSet right next to its much older OrderedDict cousin.


Index entries for this article
PythonSets


to post comments

An ordered set for Python?

Posted Oct 26, 2022 19:53 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (23 responses)

Is there a reason that the builtin set type didn't get ordering "for free" at the same time that dict did? Are they using different hash table implementations? Why does CPython have two different hash tables? That seems wasteful to me.

An ordered set for Python?

Posted Oct 26, 2022 20:16 UTC (Wed) by stop50 (subscriber, #154894) [Link] (8 responses)

One of my company internal tools in python uses sets. The order was alway the least important in that. Unique items(for simplifying api calls) or the xor of two sets was more important. the receiving and parsing around 200000 complex objects takes longer than c = a - b and d = b - a (a and b are sets of around 100000 objects where a few hundred of the are only in one of them and i need the uniques of each set seperatly)

An ordered set for Python?

Posted Oct 27, 2022 2:13 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (7 responses)

Sure, but if it costs nothing to add (as we were assured when this was done for dict), then why not make it available to everyone automatically? If you don't care about the order (which you're currently not supposed to), then that wouldn't break your use case.

An ordered set for Python?

Posted Oct 27, 2022 13:08 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (6 responses)

The ordered dict "costs nothing" in the sense that the OrderedDict implementation for Python was cheaper than the existing dict.

Consider three things:

A: The Python dict as it existed in say Python 3.0. Slow, bloated, ordering problems confused beginners

B: OrderedDict. Somewhat faster. Somewhat smaller. No ordering problems

C: Swiss Tables from abseil but somehow as a Python feature. Very fast. Very small. Ordering problems are back.

The situation was that A existed, and now B existed, Python used B as a drop-in replacement for A because that's a no brainer, but then the question was do you document that dict is an ordered dict or do you insist dict might have ordering problems (even though it actually doesn't) in case some volunteer writes C some day and you use that to replace dict again.

Python decided to document that dict is ordered. If you make C some day it will need to be named UnorderedDict or something, which seems reasonable, as beginners are indeed likely to be surprised that it isn't ordered.

Low-level hash tables like Facebook's F14 (for C++) or Rust's HashMap are not ordered, and would indeed be slower if you made them ordered, which is why they aren't.

An ordered set for Python?

Posted Oct 28, 2022 9:50 UTC (Fri) by eru (subscriber, #2753) [Link] (5 responses)

The C++ standard library has both "map" which is ordered, and a separate "unordered_map". The unordered version is said to be a bit faster (have not measured myself), but there is no significant difference for most purposes.

An ordered set for Python?

Posted Oct 28, 2022 10:21 UTC (Fri) by excors (subscriber, #95769) [Link] (3 responses)

In C++ there are far more differences than just the ordering - e.g. std::map is specified to have O(log n) lookups while std::unordered_map is O(1) (average case); map<K,V> requires an implementation of less<K> (a strict total order) while unordered_map<K,V> requires hash<K> (a hashing function); unordered_map iterators can be invalidated on insert while map iterators must remain valid; etc. Essentially map must be implemented as a balanced binary tree and unordered_map must be implemented as a hash table. They're radically different data structures.

An ordered set for Python?

Posted Oct 28, 2022 20:14 UTC (Fri) by eru (subscriber, #2753) [Link]

Thanks for the explanation, learned something! So far in my code they had appeared almost interchangeable because I had been mostly using data like numbers and strings for which the library supplies the ordering and hash.

An ordered set for Python?

Posted Oct 29, 2022 17:04 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> is specified to have O(log n) lookups

Which in practice is O(1) for any reasonable tree sizes (just as O(1) of amortized complexity in hash maps is not quite O(1)).

The greater _practical_ difference between a hash table and a simple binary tree is the amount of pointer chasing needed for lookups and its effects on cache.

An ordered set for Python?

Posted Nov 1, 2022 12:43 UTC (Tue) by paulj (subscriber, #341) [Link]

In practice, until the day an adversary notices its and sends packets/data/input to spam a single hash bucket, trigger O(n) behaviour and DoS your system. ;)

An ordered set for Python?

Posted Nov 6, 2022 13:01 UTC (Sun) by njs (subscriber, #40338) [Link]

There are two contradictory types of "ordered" here. C++'s std::map is sorted – keys need to have a defined comparison function, and iteration visits the keys from "smallest" to "largest". Python's dict OTOH is insertion-ordered: the keys don't need to have any associated comparison function, and new keys always get appended at the end of the order. So they're actually very different beasts with very different use cases, despite the similar-sounding terminology.

An ordered set for Python?

Posted Oct 26, 2022 20:56 UTC (Wed) by dskoll (subscriber, #1630) [Link] (8 responses)

Hash tables don't give you ordering "for free". In fact, not only is the order you get random-looking, it can change from one run to the next. That's because most modern hash table implementations use a pseudo-random seed to change the hash function from run to run. This is used to defend against algorithmic complexity attacks, whereby specially-crafted data can turn the expected O(1) speed of a hash lookup into O(N).

An ordered set for Python?

Posted Oct 26, 2022 21:29 UTC (Wed) by osandov (subscriber, #97963) [Link] (6 responses)

Hash tables in general aren't ordered, of course, but Python's dict type remembers the insertion order since Python 3.7 (well, since Python 3.6, but that was a CPython implementation detail; since Python 3.7, it's part of the language spec).

The original question was: since Python 3.6 changed the dict hash table implementation to be ordered, why didn't sets become ordered, as well? It appears that sets have a completely separate hash table implementation from dicts, but I don't know the reason.

An ordered set for Python?

Posted Oct 26, 2022 21:37 UTC (Wed) by dskoll (subscriber, #1630) [Link]

I guess sets and dicts are two different things. Maybe one of them was originally implemented as some sort of tree rather than a hash table? Dunno.

An ordered set for Python?

Posted Oct 27, 2022 12:28 UTC (Thu) by mgedmin (guest, #34497) [Link] (4 responses)

Since Python 3.6 a dict is internally represented by a variable-sized hash table and a vector of (key, value) pairs. The hash table maps hash(key) -> index in the vector, and it's done that way for performance reasons, to keep the hash table small (the indexes can be 8-bit or 16-bit if there are few keys, most of the hash table slots are unused and it's cheaper to have a single unused byte than two null pointers per slot).

Since Python 2.5(iirc?) a set is internally represented by a hash table that maps hash(key) -> key. It was also optimized for efficiency, compared to a previous implementation based on dicts, since sets don't need to store values, and also because incrementing/decrementing refcounts for the None singleton when adding/removing set items was wasteful.

Neither hash ordering nor dict ordering were the goal of the dict implementation rewrite that gave us dict ordering as a free feature. Sets were not affected because they used a different implementation already, the waste of empty hash slots was half that of dicts, and ordering was not seen as a requirement.

Dict ordering graduated to an actual feature rather than a side effect because it was found to be very very very useful. I guess nobody felt strong enough about set ordering to actually suggest an implementation.

An ordered set for Python?

Posted Oct 31, 2022 14:26 UTC (Mon) by 0xdky (guest, #141261) [Link] (2 responses)

> internally represented by a variable-sized hash table and a vector of (key, value) pairs

Does this imply that deleting a key somewhere in middle will trigger remapping all hash entries below it in ordered vector as map maintains key to index?

An ordered set for Python?

Posted Nov 2, 2022 9:22 UTC (Wed) by mgedmin (guest, #34497) [Link] (1 responses)

I think they use tombstones in the vector when deleting keys? I read the blog post describing the data structure back when it was introduced, but don't remember all the details.

An ordered set for Python?

Posted Nov 6, 2022 13:08 UTC (Sun) by njs (subscriber, #40338) [Link]

Yeah, there's a tombstone, and then when you rehash you have to loop through all the live entries anyway, so it's ~free to compact out the tombstones at the same time. You just need to add another trigger for rehashing: table fill ratio gets too high/low *or* tombstone ratio gets too high.

An ordered set for Python?

Posted Nov 1, 2022 16:24 UTC (Tue) by Wol (subscriber, #4433) [Link]

> Dict ordering graduated to an actual feature rather than a side effect because it was found to be very very very useful. I guess nobody felt strong enough about set ordering to actually suggest an implementation.

An ordered set is not a set, it's a unique list. I rail about Relational not understanding what lists are, but at the end of the day, lists and sets are different things. Where a dict fits, I don't know.

But if you want an ordered set, call it what it is namely a unique ordered list, and an efficient implementation may well become obvious. Just don't muddy the waters by calling it a set ... :-)

Cheers,
Wol

An ordered set for Python?

Posted Oct 26, 2022 22:14 UTC (Wed) by Wol (subscriber, #4433) [Link]

And I'm used to linear hashing where, as the hash table grows, the modulo varies dynamically with the table load, so while the order remains approximately constant every item added or deleted can trigger a partial reshuffle.

And unless you ask for a sorted list, every scan of the table just works its way down the buckets from 0 ...

Cheers,
Wol

An ordered set for Python?

Posted Oct 27, 2022 7:20 UTC (Thu) by ehiggs (subscriber, #90713) [Link] (3 responses)

This has caused me some trouble when coding using sets since I think of sets as dicts without values. Like how BTreeSet is implemented in Rust: BTreeMap<K, ()> under the hood. I'm really happy for this proposal!

An ordered set for Python?

Posted Oct 27, 2022 9:14 UTC (Thu) by NRArnot (subscriber, #3033) [Link] (2 responses)

But why not just use an OrderedDict (or ordinary dict since 3.7) and ignore the stored values which would all be None or 1 or something? The only answer I can think of, lies in the realms of efficiency for large sets. In which case the justification would need to be that an OrderedSet was significantly better than an OrderedDict *because* it doesn't store values, only keys/members.

An ordered set for Python?

Posted Oct 27, 2022 14:09 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (1 responses)

As well as the tremendous wasted space for storing None, BTreeSet<K> isn't just a BTreeMap<K,()> that's a technical element of how it is implemented.

The actual APIs are wildly different, thus using OrderedDict is a poor substitute. There are predicates like is_disjoint() and symmetric_difference() which make plenty of sense for a set but not a map.

Now, BTreeSet has actual inherent order because it's a binary tree. numbers.last() is the biggest number in the set and numbers.first() is smallest. So hopefully in the case of BTreeSet ehiggs was not expecting it to have some other order like OrderedDict, 'cos that's not what a binary tree is for. But the general idea is sound.

An ordered set for Python?

Posted Feb 21, 2025 15:23 UTC (Fri) by ehiggs (subscriber, #90713) [Link]

> BTreeSet has actual inherent order because it's a binary tree

It's a common misconception to conflate Binary Trees and BTrees. They are ordered trees - yes. But BTrees have much wider fan out than Binary trees.

An ordered set for Python?

Posted Oct 27, 2022 19:53 UTC (Thu) by togga (subscriber, #53103) [Link]

For robustness and flush out bugs its often more useful to have randomized iterating order over sets. If you need to keep order or be sorted you probably need another datatype.

An ordered set for Python?

Posted Nov 1, 2022 22:18 UTC (Tue) by qys (guest, #156455) [Link] (1 responses)

As someone who does LeetCode with Python, I have always been wondering why there is no TreeSet or TreeMap in Python. They maintain another sort of orderedness, where the elements are ordered by their values instead of the insertion order. I believe they are worth including in the standard library for pedagogical reasons alone: many people use Python as their first programming language, and it would be great to have them think about the performance tradeoff of using tree versus hash without learning how PyPI works.

An ordered set for Python?

Posted Mar 25, 2023 15:25 UTC (Sat) by lordgilman (guest, #60767) [Link]

I think heapq exists in the standard library to fill the niche of "data structure with OK algorithmic complexity that also maintains a sort order."


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds