Python adopts SipHash
Hash collisions are a fact of life, but one that can have serious consequences when attackers can control the values being hashed. We looked at this problem, which can allow denial-of-service attacks, back in January 2012 for Python, PHP, Java, Ruby, JavaScript, and other dynamic languages. While fixes were made at that time, the hash function used by Python was still vulnerable to certain kinds of attacks. That situation is now changing, with Python Enhancement Proposal (PEP) 456 having been accepted for inclusion in the upcoming Python 3.4 release. That means Python will be using SipHash for its hash function going forward.
Hash functions are used to reproducibly turn an arbitrary string (or series of bytes) into a single value (of a length determined by the type of hash) that can be used for various purposes. For example, cryptographic hash functions are used to derive digest values for digital files that can then be used with signature algorithms to digitally "sign" documents or other data (e.g. distribution software packages). Hash functions are also used for data structures like dictionaries (aka hashes or associative arrays) where the function maps the key to a value that can be used to find the data associated with the key in the dictionary.
But, with any hash function, multiple keys will hash to the same value. In a data structure, that is often handled by making each hash "bucket" actually be a linked list of the colliding entries—normally just a few. Operations, such as lookup, insert, or delete, on keys that hash to a particular bucket then have to traverse the list. If the number of collisions is low, the effect of a short list traversal is minimal, but if that number is high, it can significantly impact the performance of those operations.
Normally, languages try to choose hash functions that will fairly evenly spread the expected key values into the hash space. But if the hash function is known to an attacker, and they can arrange to provide the key values to be hashed, denial-of-service attacks are possible. One way attackers can do that is with HTTP POST (form submission, essentially) requests. Many web application frameworks helpfully collect up all of the POSTed variables into a dictionary for delivery to the application. Just supplying a list of variables that all hash to the same bucket (and possibly submitting that POST multiple times) may be enough to bring a web server to its knees.
This was all discovered in Perl in 2003, then rediscovered for other languages in late 2011. A bug was opened for Python, then closed a few months later after the hash function used by the interpreter was randomized based on a flag (-R) given at run time. That didn't fully solve the problem, however, since effectively only 256 separate functions were used—an attacker could use various techniques to determine the hash function, thus could still cause a denial of service. In fact, Jean-Philippe Aumasson and Daniel J. Bernstein developed a proof-of-concept attack to recover the seed used to randomize the hash function for Python 2.7.3 and 3.2.3.
Shortly after the original bug was closed, a new bug was filed that recognized
the inadequacy of the solution, but it took 18 months or so before PEP 456
was formally accepted for inclusion into Python
3.4. PEP author Christian Heimes looked at several different hash
functions before settling on the SipHash24 variant. It "provides the
best combination of speed and security
" and several other
high-profile projects (e.g. Ruby, Perl, Rust, FreeBSD, Redis, ...) have
chosen SipHash.
Unlike earlier hash functions used by Python and others, SipHash is a
cryptographic hash. Python's current implementation uses
a modified Fowler-Noll-Vo (FNV) hash function, which was changed to add
a random prefix and suffix to the bytes being hashed. But Heimes is
convinced that "the nature of a non-cryptographic hash function makes
it impossible to conceal the secrets
".
At the time that PEP 456 was being written, there was another discussion of the issue on the python-dev mailing list. Heimes started the conversation by soliciting opinions on whether the hash function should be a build-time option or be switchable at run time. Most felt that choosing the algorithm at compile time was sufficient, but some, including Python benevolent dictator for life (BDFL) Guido van Rossum, were not at all convinced that any change was needed.
Van Rossum was concerned that the problem is
largely manufactured by "some security researchers drumming up
business
"—that it is only of theoretical interest. But Armin Rigo disagreed:
In the end, Van Rossum did not put his foot down; he said that he was "fine with a new
hash function as long as it's either faster, or safer and not
slower
". SipHash24 has been within a few percent of the performance
of the existing hash function on several different benchmarks. There are
concerns that it will impact performance for short keys (less than, say, seven
bytes) because it has some setup and teardown costs, so switching to a
faster but less secure hash for short keys is being investigated.
According to the PEP,
there are multiple places in the CPython code
that use their own version of the hash algorithm: "the current hash
algorithm is hard-coded and implemented multiple times for bytes and three
different Unicode representations
".
That would make it
harder for someone trying to put in their own replacement, so the PEP also
proposes reworking the internals of Python such that the hash function can be
replaced
in a single location. That will appear in Python 3.4 as well.
It seems a bit strange that it took this long for Python to fix the
problem. As Rigo said, ignoring the problem might be a reasonable approach
for a substantial fraction of Python users, but that was true before the
previous fix was applied as well. Given that Python developers presumably
don't just
want to apply a cosmetic fix, it is a little surprising it took this long
to get to the "proper" solution. But Larry Hastings may be right with his suggestion that "there was enough bike
shedding that people ran out of steam
" to immediately address the
problem.
Given how widespread SipHash is now for dictionary hash functions, we are potentially vulnerable to some kind of breakthrough in finding collisions in that algorithm. But, at least there are a full 64 bits of entropy being used by SipHash (rather than the eight bits for the modified FNV function). That should at least make brute force attacks infeasible—we will just need to keep our eye out for cryptographic breakthroughs down the road.
| Index entries for this article | |
|---|---|
| Security | Python |
| Security | Vulnerabilities/Denial of service |
