Python and hashing None
The recent discussion of a proposed change to the Python language—the usual fare on the language's Ideas forum—was interesting, somewhat less for the actual feature under discussion than for the other issues raised. The change itself is a minor, convenience feature that would provide a reproducible iteration order for certain kinds of sets between separate invocations of the interpreter. That is a pretty limited use case, and one that could perhaps be fulfilled in other ways, but the discussion also highlighted some potentially worrying trends in the way that feature ideas are handled in the Python community.
An idea
On November 16, Yoni Lavi posted
his idea (in a post that is "temporarily hidden
", for unclear
reasons, but is still viewable) that the Python None singleton should
consistently hash to the same value—even on separate runs of the
interpreter. His goal is to be able to reproduce and verify successive runs
of a complex application. Currently, those values are often, but not always,
different:
$ python -c "print(hash(None))"
5928233778750
$ python -c "print(hash(None))"
5880283780414
For Python binaries built with address-space
layout randomization (ASLR), as above, the address of the None
object changes on each interpreter run. Without ASLR, a given CPython
binary will repeatedly produce the same value. For CPython, the "object ID"
(i.e. id())
of an
object, in the general case, is its address, which is used to generate the hash()
value. Those hash values are used for speeding up the lookup of dictionary
entries, but they have some other effects as well.
In particular, as Lavi described, the hash value determines the order of set iteration, so reproducing program behavior from run to run is not possible when None can be a member of the set. Beyond that, other types (e.g. tuple, collections.namedtuple, and frozen dataclasses.dataclass) that might be stored in a set can get "infected" by the non-reproducibility because they contain None fields. For example:
$ python -c "print({ (1, None), (2, 3) })"
{(2, 3), (1, None)}
$ python -c "print({ (1, None), (2, 3) })"
{(1, None), (2, 3)}
Because one of the tuples contains None, the order in the set can
change between runs, but the same is not true if there are only simple
values in the tuples. So it is not directly a problem with
hash(None) that Lavi is having; the problem lies in exactly
reproducing behavior (and, presumably, output) for program validation and
debugging purposes when these somewhat complicated data structures are in
use.
The hashes for some simple objects are reproducible, however. Numbers always hash to their values, True and False are special (1 and 0, respectively), floating-point values have reproducible hashes, and so on. Strings used to have reproducible hashes, but that led to a denial-of-service vulnerability via hash collisions, especially for web frameworks that process lots of untrusted strings. Strings (and bytes) now have random hash values based on a seed that gets set when the interpreter is initialized; those interested in reproducible hashes can set PYTHONHASHSEED, however. But there is no equivalent for a constant hash(None).
Reaction
Steven D'Aprano cautioned that the order of elements in a set depends on more than just the hash values of its elements; it also depends on the order of operations on the set (some of which can be seen in our Python ordered set article from October). The ordering is also specific to the current CPython set implementation, which could change without warning:
Because sets are documented as being unordered, in theory the iteration order could even change from one call to another: str(a) == str(a) is not guaranteed to be true, even if the set a doesn't change from one call to the next. If it happens to be true right now, that's an accident of the implementation, not a promise made by the language.
A fixed value for hash(None) seemed
"harmless enough
" to Paul Moore, though he saw no reason for it
to be an option; "We should either do it or not.
" He was a little
concerned that there might be some security implication from making the change,
since ASLR is done for security purposes.
Lavi did not
see much difference from the hash values of True and
False and Petr Viktorin agreed
that a constant hash(None) should not be a security
problem. In fact, not leaking information about the base address being
used for ASLR "might be a slight win, actually
".
Lavi filed a GitHub issue and a pull request for a trivial patch that simply hardcodes the value of hash(None) to 0xbadcab1e. The issue entry describes the problem at more length:
CPython's builtin set classes, as do all other non-concurrent hash-tables, either open or closed, AFAIK, grant the user a certain stability property. Given a specific sequence of initialization and subsequent mutation (if any), and given specific inputs with certain hash values, if one were to "replay" it, the result set will be in the same observable state every time: not only have the same items (correctness), but also they would be retrieved from the set in the same order when iterated.This property means that code that starts out with identical data, performs computations and makes decisions based on the results will behave identically between runs. For example, if based on some mathematical properties of the input, we have computed a set of N valid choices, they are given integer scores, then we pick the first choice that has maximal score. If the set guarantees the property described above, we are also guaranteed that the exact same choice will be made every time this code runs, even in case of ties. This is very helpful for reproducibility, especially in complex algorithmic code that makes a lot of combinatorial decisions of that kind.
Creating a set class that does have those properties—and maintains them between
runs—is possible, of course, but there is a substantial performance
penalty to doing so, Lavi said. The reproducibility he describes does not
require
the stricter guarantee that would come with some kind of ordered set
either. Meanwhile, in a large code base using some alternate mechanism,
someone could unknowingly use a
regular set and reintroduce the problem all over again.
While determinism between runs is a "rather niche
" concern,
the change he suggests is minimal and, seemingly, harmless.
But core developer Raymond Hettinger closed
the issue with the explanation that it did not make sense: "The
default hash for every object is its object id. There is nothing special
about None in this regard.
" Back in the thread, Viktorin was a bit
apologetic about encouraging Lavi with this idea, but noted that there
are some potential downsides even to the small change suggested:
Assuming that a set with the same hashes and order of construction is relying an implementation detail that might change in the future. It doesn't need to be a change in CPython itself – some extra caching or serialization might re-create the object and ruin the determinism, because set order is not important.Making the None hash constant would lead more people down this path. That's the external cost of the change – definitely not zero.
The thread starts to go off the rails shortly thereafter, with the same
complaints (and rebuttals) being raised multiple times. As is common
when threads go awry, the participants are often talking past each other.
There are certainly other ways to achieve what Lavi is trying to do, but
they have their own costs, both in performance and in
"enforcing a non-standard idiom over an entire codebase
" as he put
it.
From a high-level perspective, wanting determinism between
invocations of the interpreter does not seem completely unreasonable, but
it would rely on sets having guaranteed behavior that the Python core
developers are not willing to grant—even if the hash of None was
already a
constant. D'Aprano summarized
the discussion shortly before the thread was closed for being
argumentative. He suggested turning off ASLR or using classes that
implement their own __hash__()
method as a possible approach, but
warned that "it is still unsafe to assume that set iteration
order will be predictable
"
New threads
A few days after the original thread was closed, Lavi was back on November 27 with a request to reconsider the idea and reopen the pull request. It did not really add anything new to the debate, though he did attach a lengthy document further describing the problem and some of the issues surrounding it. He had been asked to try again (apparently on Discord), so he posted it to both the forum and to the largely moribund python-dev mailing list.
On python-dev, the thread ran much of the same course as the original forum post did, though there were some surprises, perhaps. Python creator Guido van Rossum said that he would be perfectly happy to see sets have the same ordering guarantees as dicts, if there are no performance degradations that result. Furthermore:
"Sets weren't meant to be deterministic" sounds like a remnant of the old philosophy, where we said the same about dicts -- until they became deterministic without slowing down, and then everybody loved it.
Meanwhile, he noted that hashing an empty tuple (hash(())) is
stable over successive runs; the arguments against making
hash(None) stable "feel like rationalizations for
FUD
". He thinks that there is a good argument for making
hash(None) constant across interpreter runs.
Part of the problem in all of the threads is confusion about what Lavi is asking for. He does not require any specific ordering for sets, not the insertion-order guarantee of dicts nor any other. The determinism he is looking for is already available in the language, even though it is not guaranteed, as long as None (or things containing it) is not used in the set. It is, in fact, hard to see how there would be any benefit to some hypothetical version of Python that did not produce the same order for set iteration if the order of operations constructing the set are the same—unless that order was being deliberately perturbed for some other reason.
Meanwhile, over in the forum thread, Moore immediately responded that he thought Lavi was wasting his own and everyone else's time. Lavi agreed and suggested the topic be deleted; that was not his last suggestion to drop the whole thing that would be ignored by others in the thread. Chris Angelico, who was sympathetic to the idea in the earlier thread, pointed out that there is something of a trend here:
Only because there's a general tendency to yell at ideas until they go away, which isn't exactly a good policy, but it's an effective way of wielding "status quo wins a stalemate" to win arguments. See the previous thread for details, or look at any of myriad other places where an idea was killed the same way.
He is referring to a blog
post from Nick Coghlan entitled "Status quo wins a stalemate" that
D'Aprano and others had mentioned. But, as
Angelico noted, the
key word is "stalemate"; "Not 'status quo wins any debate if we can just
tire out everyone else'.
" But Moore said there
was more to it than that, at least in this case; a core developer had
reviewed the pull request and rejected it. Even though the change is
"small and relatively [innocuous]
", it was not important enough
"to warrant overturning another core dev's decision
" in his opinion.
Angelico further lamented
the treatment of new ideas: "It's like posting an idea is treated as a
gauntlet that has to be run - not of technical merit, but of pure
endurance.
" Moore agreed that
it is a problem, but the flipside is one as well:
But it's equally bad if people propose ideas with little or no justification, and no appreciation of the costs of their proposal. Making people "run the gauntlet" is a bad way of weeding out bad proposals, but we don't seem very good at finding a better way (I've seen far too many "I think it would be neat if Python did such-and-such" proposals that wasted way too much of people's time because they didn't get shut down fast enough or effectively enough).
That led to some discussion of said costs; D'Aprano posted a lengthy list, which Lavi then rebutted, but it is clear that no minds are being changed at this point. It seems unlikely that any core developer feels strongly enough to override Hettinger's rejection, but the discussion continues on. In part, that's because proposals sometimes have a different outcome after they have been discussed for "a while"—or if they periodically get resurrected until they finally succeed. Witness structural pattern matching, which had been proposed and discussed many times in different guises until it was adopted, or dict "addition", which had a similar path. As Angelico put it:
It's been proven that tenacity in the face of opposition is crucial to a proposal's success. Proven time and again with successful proposals, not just failed ones. Are you surprised, then, when one rejection doesn't kill a proposal instantly?That's exactly my point. It's a gauntlet, not a technical discussion.
It is perhaps inconvenient for Lavi (and others), but the hash(None) "problem" is apparently not going away. It seems like there should be a straightforward "fix" or workaround for it, but wholesale changes, and subsequent enforcement of them, seem like the only way forward—at least if always running without ASLR is not an option. That is all rather unsatisfying, in truth. And the gauntlet remains for the next ideas that the community comes up with.
| Index entries for this article | |
|---|---|
| Python | Ideas |
| Python | None |
| Python | Sets |
