By Jake Edge
January 11, 2012
Man pages (or perldoc pages in this case) are not the usual places to look
for security holes, but that is apparently where two researchers found a
whole class of problems for web application frameworks (and the languages
underlying them). The problem concerns dictionaries (aka hashes, depending
on the language) and was fixed in Perl in 2003, but the need for a fix
evidently wasn't clear to other high-level/scripting language developers.
So, Julian Wälde and Alexander Klink were able to report efficient
denial-of-service attacks against PHP, Python, Java, Ruby, and the v8
JavaScript engine via some of the web frameworks that use those
languages at the recently held Chaos Communication Congress.
The "perlsec" perldoc page has a section
on "Algorithmic Complexity Attacks" that explains the problem that was
fixed in Perl and is (or was) present in other languages. Dictionaries are
data structures offered by some languages to store key/value pairs, which
are implemented using a hash table based on the key. Unlike a cryptographic
hash, the hash functions that are used for hash tables are meant to be
efficient in terms of CPU usage, but still offer collision
resistance. But, if the same hash function is used for every
installation and invocation of a given language, one can fairly easily
calculate a set of
dictionary keys that collide. That's where the problem starts.
A hash table needs to have some way of dealing with collisions, where two
different key values hash to the same value. One common way is to have
each hash "bucket" be a linked list of all the dictionary entries that hash
to the bucket index. In most cases, that works just fine as the number of
collisions is typically small, so the slight inefficiency of traversing the
list for insertion, deletion, and searching is in the noise. But if an
attacker can control the keys that are being inserted, they can arrange
that all of them hash to the same bucket.
So, that turns the usual case of an insertion requiring the traversal of a
few collisions, at worst, to an insertion requiring traversal of all
entries added so far.
If all keys hash to the same bucket, the entire linked list for the
bucket must be traversed in order to insert a new key. During the traversal,
each stored key must be compared with the insertion key to see if it is
already present (which it wouldn't be in the attack scenario, but the hash
table insertion code doesn't know that). As more and more keys get added, that
traversal takes longer and longer, to a point where it can chew up some
fairly significant CPU time. But how can an attacker control the key
values that get stored?
That's where web frameworks come into play. Those frameworks helpfully
collect up all of the POST form data that gets sent to
a particular web page into a dictionary that gets handed off
to the application for processing. The amount of data that the frameworks
will allow is typically rather large (e.g. 1MB), so an attacker can create
tens of thousands of form entries in a single HTTP request. Because they
all collide, those entries can take tens of seconds to a few minutes to
process on an
i7 core according to the advisory
[PDF] released by Wälde and Klink.
Because the languages do not randomize the hash function used in any way,
the collisions that get calculated are usable for a wide variety of web
applications. It is likely that collisions calculated for Python would
work on most Python-based web applications for example. The presentation
(slides
[PDF])
mentions two different techniques to find collisions that are computationally
feasible, but they are both "offline" calculations and cannot be used if
the hash function changes between invocations and
sites. Thus, the suggested fix for the problem is to choose a random seed
that gets mixed into the hash so that
collisions are not predictable. That seed would be chosen at language
start-up time so that each invocation essentially had its own hash function.
There are workarounds for the problem as well. Web frameworks could
(probably should) limit the amount of form data that can be processed
and/or the number of form elements that are allowed in a given
POST. Another potential workaround is to limit the amount of CPU
time that a web application is allowed to use. If a given POST
can only soak up ten seconds of a core, for example, it will take more
of them (i.e. more bandwidth) to fully saturate a given server or set of
servers.
While randomizing the hash function is suggested, it may not be a complete
solution. Discussion on the python-dev
mailing list and in Python bug 13703 indicate that
more is needed to eradicate the problem. As Paul McMillan put it: "I've
gotten confirmation from several other sources that the fix recommended by
the presenters (just a random initialization seed) only prevents the most
basic form of the attack." For long-running Python interpreters, it
may be feasible for an attacker to brute force or otherwise determine the
random seed that was used.
Determining the random seed will be easier to do with an "oracle function",
which is some mechanism that an attacker can use to get information about
the hash values of various strings. It does not require that the "raw"
hash values be returned, necessarily, as other information (like JSON
dictionary key ordering, for example) can be used to deduce the seed.
Those are much harder attacks than the simple "POST with thousands
of collisions" attack that is available today, however.
Other languages, including Perl and Ruby have fixed the simpler
problem, but Python is looking at going further. One way to do that
would be to use a much larger seed, and choose pieces of the seed to add
into the hash based on properties of the input string, which essentially
increases the search space for brute force and other means of deducing the
seed.
PHP has also started to consider fixes. According to the
advisory, Oracle has determined that no fix will be applied to Java itself,
but that the GlassFish application server will be updated to fix or
work around the problem.
The 2003 research
[PDF] by Scott A. Crosby and Dan S. Wallach clearly showed the flaw,
but didn't apply it to any other language beyond Perl. That led to the Perl
fix, but even a heads-up
message to python-dev by Crosby was not enough to cause a fix in Python
at that time. Part of the problem was the inability to point to an easy
way for an attacker to control the keys put into a dictionary. Evidently,
the web frameworks were not considered at that time.
According to McMillan, though, the web frameworks are just the tip of the
iceberg in terms of the scope of the problem. Various Python developers
had suggested that the problem should be fixed in the web frameworks (as
Java is evidently doing) or in
a few standard libraries (like urllib, cgi, and json), but McMillan states that the
problem goes
much further than that:
A short and very incomplete
list of vulnerable standard lib modules includes: every single parsing
library (json, xml, html, plus all the third party libraries that do
that), all of numpy (because it processes data which probably came
from a user [yes, integers can trigger the vulnerability]), difflib,
the math module, most database adaptors, anything that parses metadata
(including commonly used third party libs like PIL), the tarfile lib
along with other compressed format handlers, the csv module,
robotparser, plistlib, argparse, pretty much everything under the
heading of "18. Internet Data Handling" (email, mailbox, mimetypes,
etc.), "19. Structured Markup Processing Tools", "20. Internet
Protocols and Support", "21. Multimedia Services", "22.
Internationalization", TKinter, and all the os calls that handle
filenames. The list is impossibly large, even if we completely ignore
user code. This MUST be fixed at a language level.
Work on testing and benchmarking changes for Python is ongoing, but the
plan is to provide a security release for 2.6, 2.7, and 3.3, while also
providing backported patches for earlier Pythons. PHP is discussing options,
including placing limits on the length of hash collision chains and
limiting the number of form variables that will be processed, while still
planning to add randomization into the hashing function.
It is interesting to see a nearly ten-year-old flaw pop back up, though it
seems likely that the flaw hasn't been exploited widely (or fixes would
have come sooner). It does serve as a reminder that what may seem like
theoretical vulnerabilities will, over time, often become actual
vulnerabilities. It's sometimes hard to consider making changes for
"theoretical" attacks, which may well be the right choice at the time, but
it is worth
pondering the likelihood that eventually the change will need to be made.
Like most security questions—as with the rest of software
development—it's always a matter of trade-offs.
(
Log in to post comments)