A pair of Python vulnerabilities

By Jake Edge
February 24, 2021

Two separate vulnerabilities led to the fast-tracked release of Python 3.9.2 and 3.8.8 on February 19, though source-only releases of 3.7.10 and 3.6.13 came a few days earlier. The vulnerabilities may be problematic for some Python users and workloads; one could potentially lead to remote code execution. The other is, arguably, not exactly a flaw in the Python standard library—it simply also follows an older standard—but it can lead to web cache poisoning attacks.

Overflowing

The first vulnerability addressed by the updates is CVE-2021-3177, which is a buffer overflow that was reported in a bug filed on January 16. The problem occurs when the ctypes module, which provides C-compatible data types for Python, is used to convert a floating-point number into a string with a non-exponential approximation of its value. A sufficiently large number will crash the interpreter, as this proof of concept (adapted from the one in the bug report) shows:

    >>> from ctypes import *
    >>> c_double.from_param(1e30)
    <cparam 'd' (1000000000000000019884624838656.000000)>
    >>> c_double.from_param(1e300)
    *** buffer overflow detected ***: terminated
    Aborted

As the error message indicates, the problem is a buffer overflow, specifically because the sprintf() C library function, which "prints" a formatted string to a buffer, assumes that the buffer provided is large enough—with predictable results. Problems with sprintf() overflowing buffers probably go back over 50 years at this point, but there are safer alternatives. Bug reporter Jordy Zomer suggested using snprintf(), which takes a size parameter; it will not write more than size-1 characters to the string plus the terminating NUL.

The offending code was also part of the report:

    case 'd':
        sprintf(buffer, "<cparam '%c' (%f)>",
            self->tag, self->value.d);
        break;

The "%f" format specifier requests the string form of the float value, so passing a really large number (1e300, say) will create a huge string that overflows the 256-byte, stack-based buffer. If a program was using that conversion on user-controlled input, code execution could be the result, since an attacker has some control of what the overflow writes to the stack. If that user-controlled input comes from across the network, of course, remote code execution is possible. However, on the python-dev mailing list, Stephen J. Turnbull expressed some doubts that the bug was actually all that exploitable; he thought it was probably not really necessary to rush out releases to address it.

Once the bug was reported, Benjamin Peterson fixed the problem in multiple Python branches in short order. Instead of using snprintf(), though, he used a combination of PyFloat_FromDouble() and PyUnicode_FromFormat() for the specific problem reported. As can be seen in the January 18 patch, he also changed other calls to sprintf() in the ctypes module to use PyUnicode_FromFormat() as well. Instead of taking a buffer, PyUnicode_FromFormat() calculates the size of the resulting string, allocates it, and returns it.

As the Python security bug-tracking page shows, the problem was fixed in two days. Roughly a month later, all four currently supported versions of Python were released. While Python 2.7 is no longer supported by the core developers, some distributions are still updating it for security problems; in this case the problem is only found in Python 3.x, so backporting a fix was not necessary. [Update: As pointed out in an email from Moritz Muehlenhoff, Python 2.7 actually is affected by this bug. He notes that python2 on Debian 10 ("Buster") is affected and has been updated. Also, Fedora has a fix in progress for its python2.7 package.]

Poisoning the web cache

The second flaw addressed in the releases is CVE-2021-23336, which is a poisoning attack against web caches due to a difference in interpretation of the separator for URL query parameters. The older HTML4 spec allowed both ";" and "&" to separate query parameters, but HTML5 only allows the latter. The urllib.parse module in the standard library followed the older spec, but that could lead to problems with web caches.

According to the bug tracking page, the problem was reported to the Python security response team (PSRT) in October 2020, but it became public in a Snyk blog post on January 18 and a bug report on January 19, both by Adam Goldschmidt.

The blog post is a wealth of information about the problem of web cache poisoning and how some Python web frameworks can fall prey to the problem. The basic idea is that web queries are often cached for performance reasons using tools like Squid. The key for the cache entry is typically part of the URL; if another request matches that key, the cached version is served rather than requesting it from the host. That can reduce the response time both because the remote host is not contacted, reducing the network latency, and because whatever processing is required on that host is not done, eliminating the application latency.

There are different kinds of web cache poisoning, but the one that affects Python revolves around the two servers, cache and web, interpreting the URL differently. Python-based web applications that use urllib.parse.parse_qsl() would see both the semicolons and ampersands as separating query parameters, but the cache servers typically do not. In addition, there are a set of query parameters that are usually not used as part of the cache key (e.g. utm_content and other utm_* parameters). That allows attackers to make a query of this sort:

    https://example.com/search/?q=common_term&utm_content=s;q=malicious_term

The web cache sees a query that as two parameters, one of which is not used in the key, but the Python application running on the web server sees it as three parameters, with the second q overriding the value of the first. Now the web cache has an entry for "common_term" that leads to cached results for "malicious_term". Until the cache entry expires, users are not getting what they expect at all—and "hilarity" ensues. In truth, of course, it can result in a wide variety of unpleasant, confusing, and potentially dangerous attacks.

The obvious solution is for web applications to stop processing the semicolon as a separator, but up until the changes merged February 14 (and released shortly thereafter), there was no way to force that behavior in urllib.parse. That patch changed the library to default to only allowing the ampersand to separate query parameters. It also added a new, optional separator argument to parse_qsl() (and the related parse_qs()) that would allow developers to choose a different separator—semicolon perhaps—if they choose. It should be noted, though, that there is no facility for having more than one separator, so restoring the existing behavior of accepting both query separators is not possible.

As might be guessed, the fix itself is pretty straightforward. The bulk of the patch consists of documentation and test changes (here too). Unlike the buffer overflow, though, this bug also affects Python 2.7, as noted on the Red Hat tracking page.

It could be argued that Python is not entirely at fault here. As mentioned in the bug report, the W3C did recommend the use of the semicolon in the HTML4 spec, though only in an appendix rather than in the main body of the spec. There were also a few people who thought that changing the default (and removing the ability to handle both separators) was not the right path forward, but urllib maintainer Senthil Kumaran decided that it was best to make the fix in urllib and to do so in a way to avoid confusing web proxies.

There is nothing particularly noteworthy about these two bugs in particular, but they do provide a bit of a look into the Python security process. The buffer overflow was addressed rather quickly, but the web cache poisoning problem took rather longer to resolve. It is not entirely clear what happened in the three months between the PSRT report and the public disclosure of the bug; after that, though, the developers moved fairly rapidly. Beyond that, it is always useful to take a peek at the kinds of vulnerabilities that are arising; if nothing else, it can help keep folks on the lookout for other, similar problem areas.

Index entries for this article
Security	Python
Security	Vulnerabilities/Buffer overflow
Python	Security

A pair of Python vulnerabilities

Posted Feb 25, 2021 17:04 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

A plague on analytics and advertisers, the second attack would not be generally possible without their pervasive corruption of the web.

A pair of Python vulnerabilities

Posted Feb 26, 2021 21:05 UTC (Fri) by flussence (guest, #85566) [Link] (2 responses)

The semicolon was part of the HTML spec? The first time I heard of it was in PHP (4 at the time) and I'd assumed that just did it because it was PHP.

(I do think it should've won out though - writing and reading "&" gets old after doing web dev for a while.

A pair of Python vulnerabilities

Posted Feb 27, 2021 4:06 UTC (Sat) by excors (subscriber, #95769) [Link]

Depends what you mean by "part of". HTML4 and HTML5 both construct query strings using '&' as the separator when submitting a <form> as application/x-www-form-urlencoded (the default encoding). I don't think HTML4 (or HTML4-era DOM APIs) contained any features that would parse query strings in the browser. Modern HTML has features like URLSearchParams() which parse by splitting on '&' (https://url.spec.whatwg.org/#concept-urlencoded-parser).

As far as I can tell, the extent to which ';' was part of HTML4 is a non-normative note in an appendix saying "We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters" (i.e. to avoid having to use '&').

I believe the article is incorrect in saying "HTML5 only allows the latter" (i.e. '&'). HTML5 doesn't have an opinion on what is allowed. All it has is some algorithms to serialise and parse query strings in the browser; if an HTTP server wishes to parse the query string differently then it's perfectly free to do so, and that's outside the scope of HTML5.

It would be unwise to implement a parser where your_parser(html_serialiser(params)) != params, but the serialiser will never emit a ';' (it always gets escaped to '%3b') so a parser that splits on both '&' and ';' will be a correct inverse and there is nothing particularly wrong with doing that.

(But HTML5 does have the opinion that "The application/x-www-form-urlencoded format is in many ways an aberrant monstrosity", and suggests anyone using that format needs to be very careful. When the server consists of multiple components that parse the format in different ways, that should probably be considered a lack of carefulness.)

A pair of Python vulnerabilities

Posted Feb 28, 2021 10:15 UTC (Sun) by ceplm (subscriber, #41334) [Link]

The relevant https://bugs.python.org/issue42938 is rather interesting piece of reading. Recommended.

A pair of Python vulnerabilities

Posted Feb 26, 2021 21:51 UTC (Fri) by gioele (subscriber, #61675) [Link] (2 responses)

> Python-based web applications that use urllib.parse.parse_qsl() would see both the semicolons and ampersands as separating query parameters, but the cache servers typically do not.

Cache servers should not be in the business of reading into the query part of URL. RFC 1738 (Uniform Resource Locators) and RFC 2396 (Uniform Resource Identifiers) explicitly treat everything after the `?` as an opaque string (called the "search" or "query" component) whose interpretation is left to the application.

> 3.4. Query Component
>
> The query component is a string of information to be interpreted by
> the resource.

Applications are free to interpret `?foo=bar:quux` as "parameter `quux` is set to `foobar`" if they want.

A pair of Python vulnerabilities

Posted Mar 1, 2021 18:06 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

Unfortunately, as the article describes, websites are now in the habit of shoving useless nonsense (UTM parameters) into the URL, for marketing and tracking purposes. You generally don't want the cache to key off of those parameters, or else it becomes much less effective as a cache. Also, the cache is (usually) part of the endpoint and arguably can and should be in the business of parsing the endpoint's parameters, if the endpoint's administrators so choose.

("I'm not saying this is how it should be. I'm saying this is how it is." - Tom Scott, discussing copyright law, but equally applicable here.)

A pair of Python vulnerabilities

Posted Mar 2, 2021 10:18 UTC (Tue) by mina86 (guest, #68442) [Link]

Another reason why one would want proxies to interpret the parameters
is that ‘?foo=0&bar=1’ is ‘the same’ as ‘?bar=1&foo=0’.

Note though that this is all about proxies deployed on the server side
as part of the serving infrastructure where the proxy can be
configured to work correctly. Not about random proxies on the
Internet. The issue here was that people thought the proxy was
interpreting URLs the same way the serving code was where in reality
it wasn’t.