Python and hashing None

Posted Dec 2, 2022 1:20 UTC (Fri) by devkev (subscriber, #74096)
In reply to: Python and hashing None by NYKevin
Parent article: Python and hashing None

The specialness is indeed mostly because of the strong usage conventions - None is a well-known singleton, with ecosystem-wide expectations on how it should be used.

Anyone who wants stable-None-hashing behaviour could certainly create and use an alternative 'StableNone' in their code. But the strong conventions means it's easy to forget to use it, and accidentally use None instead.

It's also a problem if your code needs to work with other code that use None in the normal way (eg. libraries, and/or your code is a library). eg. you might need stuff like this throughout your code:

    y = somethingThatUsesNoneNormally(x if x is not StableNone else None)
    if y is None:
        y = StableNone

Yes, you could wrap/hide these in helper functions, but you still have to remember to use them everywhere they need to be, aka "sprinkle the magic pixie dust", ie. this is only marginally better:

    y = fixNone(somethingThatUsesNoneNormally(fixNone(x))

The question then becomes, why should the application programmer carry this burden? Why do they require a separate StableNone, when it serves the same purpose as None? Why can't the programmer opt-in to tweaked None hashing? There's already a lever to tweak hashing behaviour when warranted.

Python and hashing None

Posted Dec 3, 2022 7:10 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (3 responses)

There are a lot of built-in objects that inherit object.__hash__():

* Module objects
* Function objects
* Bound methods
* Various technical types (such as code objects, frame objects, etc.)

There are also a ton of such classes defined throughout the standard library (and which the average user is much more likely to run into). For example, as far as I can tell, all objects that you can obtain from the sqlite3 module inherit object.__hash__() (e.g. connections, cursors, etc.).

If you really want predictable hashing, you cannot stop at None. You have to monkey-patch object.__hash__() *anyway*, so you might as well just do that and stop asking upstream to modify the language, which never promised predictable hashing in the first place.

Python and hashing None

Posted Dec 3, 2022 10:42 UTC (Sat) by devkev (subscriber, #74096) [Link] (2 responses)

I think we will have to agree to disagree.

I'm happy to explain to anyone who is looking for consistent hashing of arbitrary objects - let alone consistent ordering of arbitrary objects in _unordered_ sets - that that expectation is unreasonable.

But, although None is technically an object, I don't think it's right to consider it an _arbitrary_ object, on an equal footing with a module object or an sqlite3 cursor object. Since it serves the purpose of a nil/null placeholder value, which can be used in place of anything else, I think it's much more productive to consider it as a special kind of basic type, eg. more in line with bool than object. In fact None really is a lot like True and False (which do have fixed hashes), except it's "unary" with only one possible value, rather than two binary/boolean values.

Based on all this I think it's totally fine to consider None's hashing needs completely independently from the hashing needs of arbitrary objects.

Python and hashing None

Posted Dec 4, 2022 1:45 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

> I'm happy to explain to anyone who is looking for consistent hashing of arbitrary objects - let alone consistent ordering of arbitrary objects in _unordered_ sets - that that expectation is unreasonable.

It is my position that this expectation is unreasonable with respect to all objects, other than small integers (which have identity hashing for performance reasons) and other numerical types. True and False have fixed hashes because True == 1 and False == 0, equal objects must hash to the same value, and all small integers hash to themselves (except -1), so it follows that True and False must also hash to 1 and 0 respectively. Larger integers and non-NaN floats are hashed in a more conventional fashion, and the hashing algorithm is defined in such a way that it reduces to an identity function for small integers, rather than having some arbitrary cutoff where an entirely different algorithm is invoked. See [1] for a full explanation of how this actually works. IMHO it might make sense to add some randomized salt to the numerical hash, but as far as I'm aware, nobody has actually managed to exploit it. You'd probably have to do some additional work to ensure that the salt is different depending on the width of the number (or else you'd have exactly the same overall set of hash collisions within the universe of ints), so that's probably not worth the complexity.

All other commonly-encountered objects, including str and all NaN-valued floats, either hash to an unpredictable value, hash to a value based on the hashes of their constituent parts (e.g. tuples), or do not hash at all. There is a weird environment variable to allow consistent hashing of str, solely because it accidentally happened to have consistent hashing once upon a time and a small number of applications may have unwisely depended on that functionality. With None, there is no compelling backcompat story, so there's no reason to extend this to any other type, including NoneType. There's also no way to prevent NaN hashing from being pointer-based, and IMHO if you're going to include None, you should at least be including that as well.

[1]: https://github.com/python/cpython/blob/f4c03484da59049eb6...

Python and hashing None

Posted Dec 4, 2022 1:52 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> IMHO it might make sense to add some randomized salt to the numerical hash, but as far as I'm aware, nobody has actually managed to exploit it. You'd probably have to do some additional work to ensure that the salt is different depending on the width of the number (or else you'd have exactly the same overall set of hash collisions within the universe of ints), so that's probably not worth the complexity.

It should also be noted that this would make it more difficult for end users to implement custom numerical types, because those are required to use the same hashing algorithm as the standard library, for compatibility reasons. So it is probably infeasible anyway.