|
|
Subscribe / Log in / New account

A literal string type for Python

By Jake Edge
April 13, 2022

Using strings with contents that are supplied by users can be fraught with peril; SQL injection is a well-known technique for attacking applications that stems from that, for example. Generally, database frameworks and libraries provide mechanisms that seek to lead programmers toward doing The Right Thing, with parameterized queries and the like, but they cannot enforce that—inventive developers will seemingly always find ways to inject user input into places it should not go. A recently adopted Python Enhancement Proposal (PEP) provides a way to enforce the use of strings that are untainted by user input, but it uses the optional typing features of the language to do so; those wanting to take advantage of it will need to be running a type-checking program.

PEP 675 ("Arbitrary Literal String Type") flew under the radar to a certain extent. It was discussed on the Python typing-sig mailing list, mostly back in January of this year, then posted to the python-dev mailing list in February, where there was little discussion. In March, it was accepted by the steering council for inclusion into Python 3.11, which is due in October. Gregory P. Smith had some interesting thoughts when he announced the acceptance on behalf of the council:

TL;DR - PEP 675 allows type checkers to help prevent bugs allowing attacker-controlled data to be passed to APIs that declare themselves as requiring literal, in-code strings.

This is a very thorough PEP with a compelling and highly relevant set of use cases. If I tried to call out all the things we like about it, it'd turn into a table of contents. It is long, but everything has a reason to be there. :)

Once implemented, we expect it to be a challenge to tighten widely used existing APIs that accept str today to be LiteralString for practical reasons of what existing code calling unrestricted APIs naturally does. The community would benefit from anyone who attempts to move a widely used existing str API to LiteralString sharing their experiences, successful or not.

As he notes, the feature will not magically fix SQL-injection vulnerabilities (or other, similar problems); it will take time to properly annotate existing APIs, for one thing. Beyond that, applications will need to be processed using one of the available Python static type checkers (e.g. mypy or pytype) and then be fixed to no longer work around the restrictions on passing literal strings.

SQL injection

The Motivation section of the PEP describes the problem well; here we adapt and condense some of its examples. Code vulnerable to a SQL injection might look something like the following:

    def query_user(conn: Connection, user_id: str) -> User:
	query = f"SELECT * FROM data WHERE user_id = {user_id}"
	conn.execute(query)
        ...

    query_user(conn, "user123")
    query_user(conn, "user123; DROP TABLE data;")
    query_user(conn, "user123 OR 1 = 1")

The query_user() function may seem reasonable at first glance and the first use of it works just fine. The other two show how the call could be misused if, for example, the user to be searched for is being read from a web form. The second call would delete the data table, while the third would retrieve all users (since 1 = 1 is always true). Anyone running a web application on today's web will be all-too-familiar with messages in their logs showing efforts to exploit this kind of problem. Of course, xkcd has also highlighted the problem in its inimitable style.

Database libraries generally have a mechanism to avoid that kind of programming error through the use of query parameters. For example:

    def query_user(conn: Connection, user_id: str) -> User:
        query = "SELECT * FROM data WHERE user_id = ?"
        conn.execute(query, (user_id,))
        ...

When developers use that mechanism, the database library makes the substitution for "?" in a safe way; instead of tacking on the extra stuff as additional SQL, it will look for users with names that are exactly what is in the string passed to query_user(). If developers would use query parameters everywhere, the problem would largely be solved, but the library cannot enforce that type of usage. It simply gets a string that it needs to execute; the library's documentation can admonish developers not to dynamically build the query string with user input, but that clearly has not stopped them from doing so.

So the PEP adds a LiteralString type to the typing module to allow developers to annotate variables, function parameters, return values, and more to indicate that the string value must be composed of literal values. So, as an example from the PEP shows, the execute() member function for a database API might be defined as follows:

    from typing import LiteralString

    def execute(self, sql: LiteralString, parameters: Iterable[str] = ...) -> Cursor: ...

That would cause uses of non-literal strings as parameters to the function to provoke a warning or error from the type checker:

    def query_user(conn: Connection, user_id: str) -> User:
	query = f"SELECT * FROM data WHERE user_id = {user_id}"
	conn.execute(query)
	# Error: Expected LiteralString, got str.

In this example, the query string has been built from other data, thus it may be susceptible to SQL-injection problems. An interpolated Python f-string is not a literal string if it is built with regular str variables, so a type checker can infer that query is not either, thus it can complain. There is a somewhat common pattern, which is often completely benign, that the PEP also takes into account. For example:

    def query_user(conn: Connection, user_id: str, limit: bool) -> User:
        query = "SELECT * FROM data WHERE user_id = ?"
	if limit:
	    query += " LIMIT 1"
	    
        conn.execute(query, (user_id,))

In that version of query_user(), query is still a LiteralString even after the limit clause has been appended, because concatenating two LiteralString types is a LiteralString. One could even expand that further by turning the "LIMIT 1" into "LIMIT ?" and building up a list of parameter values to pass to execute(). Well-written code already does those sorts of things, so the adjustment to the addition of LiteralString annotations should be minimal—for those code bases, at least. Those projects will benefit when someone slips up and inserts a potential injection into the code; the next time the type checker is run, it will loudly point out the problem.

Meanwhile, the technique can be applied more widely, as the PEP indicates:

LiteralString is also useful in other cases where we want strict command-data separation, such as when building shell commands or when rendering a string into an HTML response without escaping (see Appendix A: Other Uses). Overall, this combination of strictness and flexibility makes it easy to enforce safer API usage in sensitive code without burdening users.

Why use a type?

At first glance, using the type system to enforce this kind of behavior might seem like an odd choice, but the Rationale section of the PEP describes the other options and shows why the type system actually makes the most sense, at least for Python. A run-time approach would need to rely on heuristics that are imperfect (and highly use-case dependent). A static analyzer that looked at the abstract syntax tree to try to spot problems of this sort would be overly restrictive because it cannot determine "when a string is assigned to an intermediate variable or when it is transformed by a benign function". The type checker is better placed:

The type checker, surprisingly, does better than both because it has access to information not available in the runtime or static analysis approaches. Specifically, the type checker can tell us whether an expression has a literal string type, say Literal["foo"]. The type checker already propagates types across variable assignments or function calls.

The Literal type has been present since Python 3.8. It allows specifying the legal values for a given type; for example, a parameter that can only be a few specific values might be: Literal["a", "b", "c"]. A parameter of that type that was passed "d" would fail the type check, as would passing a regular string type:

    def foo(x: Literal["a", "b"]) -> None: ...

    foo("a")               # works
    a_var = "a string"
    foo(a_var)             # type failure

So Literal might be useful in parts of the application, but a SQL query can contain any string, so LiteralString was born:

LiteralString is the "supertype" of all literal string types. In effect, this PEP just introduces a type in the type hierarchy between Literal["foo"] and str. Any particular literal string, such as Literal["foo"] or Literal["bar"], is compatible with LiteralString, but not the other way around. The "supertype" of LiteralString itself is str. So, LiteralString is compatible with str, but not the other way around.

Due to overloading the types allowed for many of the standard str methods, as described in Appendix C, the LiteralString type can be preserved through various kinds of string transformations. For example:

    def bar(x: LiteralString, y: LiteralString) -> LiteralString:
        return ", ".join(x,y)

That works to join the two parameters with a ", " because that separator string is literal, as it is not composed from other strings, and the join() arguments also have the type LiteralString. It might seem like the language is "casting" the value to the return type, but that is not happening here; Python type annotations are not used by the language at run time. That kind of type overloading also applies to f-strings, so those can still produce a LiteralString:

    def baz(x: List[LiteralString], y: List[LiteralString]) -> LiteralString:
        xout = " ".join(x)
	yout = "+".join(y)
	return f"({xout}) '{yout}'"

    baz(["a", "b"], ["c", "d"])    # produces "(a b) 'c+d'" as a LiteralString

So the two lists of literal strings are combined using join() and an f-string to produce something that is still a LiteralString. There is a multi-line example of using that in SQL-query context in the Rejected Alternatives section of the PEP; it turns out that some existing tools for catching SQL-injection errors cannot support that kind of benign query-string construction. Note that the first example in the article could perhaps be made to "work" by declaring user_id as a LiteralString instead of a str, though passing data that came in from the outside would fail a type check without some kind of gyrations by the programmer.

Those kinds of gyrations are explicitly listed in the Limitations section. There will always be ways to evade the checks—by not running a type checker to start with—but the PEP is not meant to stop that kind of thing:

[...] ultimately a clever, malicious developer attempting to circumvent the protections offered by LiteralString will always succeed. The important thing to remember is that LiteralString is not intended to protect against malicious developers; it is meant to protect against benign developers accidentally using sensitive APIs in a dangerous way (without getting in their way otherwise).

Without LiteralString, the best enforcement tool API authors have is documentation, which is easily ignored and often not seen. With LiteralString, API misuse requires conscious thought and artifacts in the code that reviewers and future developers can notice.

Reception

In a reply to the announcement of the PEP's acceptance, Neil Schemenauer applauded the idea, noting that he had done something along the same lines years ago:

I did something like this for HTML templating in the Quixote web framework (to avoid XSS [cross-site scripting] bugs). I did it as a special kind of module with a slightly different compiler (using AST transform). With the LiteralString feature, I can implement the same kind of thing directly in Python.

Quixote has an htmltext class for strings that are considered safe because they have already been properly escaped for HTML, unlike regular strings which are generally treated as still needing to be escaped. The template system that he describes provides a way to generate HTML directly from functions, while escaping parameters and the like, so that cross-site scripting problems are avoided. Quixote was one of the earlier Python web frameworks and is still in use, including on this site.

Unlike with some other PEPs, there has not been a lot of discussion of its merits of the idea or changes needed in the text of the PEP. Some of that was done in the typing-sig thread, but much of it was something of a bikeshed exercise for the name of the type. As Smith noted, the PEP authors, Pradeep Kumar Srinivasan and Graham Bleaney, along with the core developer sponsor, Jelle Zijlstra, have done a nice job in creating a compelling case for the feature. In addition, they did so in a way that allows well-behaved code to just continue to work.

Over the years, the Python typing features have grown and matured since their introduction as "type hints" back in 2014. PEP 484 ("Type Hints") added the feature to Python 3.5 and every release since then has added more to the feature. There is a long list of typing PEPs that have been adopted; PEP 675 will soon join the party.

While the typing features (and the use of type checkers) is optional for Python, the language is clearly headed toward more widespread use of types. The rise of typed Python has led to some discomfort in the ecosystem that types are starting a slow move toward dominating decisions for the language—to the point where those who are not interested in types, per se, are being left behind. So far, at least, that does not really seem to be the case; the steering council has been trying to balance the needs of the different groups of users and has generally been successful in doing so.

There are, of course, many constituents in the pro-type camp. Some larger applications have already retrofitted type features into their code and more of that work is ongoing. Any new Python project, especially one that looks likely to grow into a significant code base, should likely carefully consider adding a type checker into the mix. There are, it seems, lots of benefits and few real downsides at this point. LiteralString will only add to those benefits over time.


Index entries for this article
PythonPython Enhancement Proposals (PEP)/PEP 675
PythonSecurity
PythonStatic typing


to post comments

A literal string type for Python

Posted Apr 14, 2022 1:20 UTC (Thu) by milesrout (subscriber, #126894) [Link] (10 responses)

This seems like a very good design. It speaks volumes that (at least from what I can tell from reading this article) the type system did not need to be modified to make this work. Many type systems in common use today could not represent this sort of subtyping relationship, as doing LiteralString.__add__(LiteralString) would give back a str. And it all disappears into the background without getting in your way if you just want to write normal Python without type annotations.

A literal string type for Python

Posted Apr 14, 2022 16:31 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (9 responses)

I fear that the "it even works if you've actually done stuff with the strings" using methods we think are safe - undoes too much of the initial value of requiring literals. I understand entirely why they did it, but my instinct is that they've unlocked far too much here. Mechanically it's obviously possible to use the capabilities marked "safe" to produce arbitrary LiteralStrings at runtime, at which point these are clearly not literal strings and I think in the sort of "Oops, I am not really a programmer" code where this safety was most necessary, that's more rather than less likely to be accessible to an attacker.

"EAT BABIES" // clearly expresses your intent, it's your fault, you wrote that.

"EAT" + " BABIES" // I can see why they felt like they should make this work, they've cited examples that do this and it genuinely is still clear at this point what your intent was, although I think it should be discouraged anyway.

doComplicatedStuffBasedOnUserInput("DO NOT EAT BABIES", input) // this still type checks as LiteralString, and might be EAT BABIES yet we can hardly claim now that we're reflecting clear programmer intent when that happens.

The reason to want literals here rather than allowing arbitrary strings is to get closer to requiring intent. I'd rather give up the second example than, as this PEP does allow the third example opportunity to set fire to everything and pretend that's "safe".

Rust of necessity has to require actual literals in formatting (not merely constant strings) because the formatting work is done via the macro system, and the macro system can't see inside variables. But I think even though more sophisticated behaviour would be welcomed by many Rust programmers I personally prefer the literal requirement.

A literal string type for Python

Posted Apr 14, 2022 16:45 UTC (Thu) by mb (subscriber, #50428) [Link] (8 responses)

Why do you think your third example is unsafe?

A literal string type for Python

Posted Apr 15, 2022 20:28 UTC (Fri) by tialaramex (subscriber, #21167) [Link] (7 responses)

Because the resulting string can be EAT BABIES, and it's entirely possible that the programmer did not anticipate the circumstances which allow that? If we're OK with that, then this entire exercise was futile, as we could have also blessed arbitrary strings.

A literal string type for Python

Posted Apr 16, 2022 8:56 UTC (Sat) by mb (subscriber, #50428) [Link] (6 responses)

The issue literal strings are trying to solve is user inputs being pasted into places where a hardcoded string is expected. That's to ensure that the program user cannot get arbitrary control over these strings.

It's not supposed to prevent the programmer from hardcoding the wrong string.

A literal string type for Python

Posted Apr 16, 2022 21:28 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (5 responses)

And that's why I have a problem with:

doComplicatedStuffBasedOnUserInput("DO NOT EAT BABIES", input)

"DO NOT EAT BABIES" is blessed as a LiteralString because it is. No problem so far. But "cleverly" this proposal allows operations (such as truncation, concatenation, duplication and splitting) on LiteralString to produce a LiteralString, and so if doComplicatedStuffBasedOnUserInput has a bug, as it may well do, it can end up producing quite unexpected results, such as "EAT BABIES" and yet they're blessed as LiteralString anyway via this rationale.

Thus, the program user in fact gets arbitrary control over these strings in at least some cases, whereas that's definitively not the situation in languages where there's an actual literal string type. In exchange, Python gets to write "WO" + "RDS" and have that be a LiteralString whereas in the other languages it is not. I think that's a bad trade, despite being very clever.

A literal string type for Python

Posted Apr 19, 2022 4:58 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (4 responses)

Looking through Appendix C of the PEP (which lists the operations supported), I see removeprefix/removesuffix, but not slicing (__getitem__), so you would have to write something like s.removeprefix("DO NOT ") to get the outcome which you describe (i.e. you *can't* write s[:7] or something like that). If you explicitly write removeprefix("DO NOT "), then IMHO it's your own damn fault for removing a prefix which you apparently wanted to keep.

A literal string type for Python

Posted Apr 19, 2022 11:06 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (3 responses)

> you *can't* write s[:7] or something like that

If there were LiteralNumber, one might be able to do that, but without, there's no difference between a literal 7 and a 7 coming in from "the outside" through a variable. Though there are a number of other methods that take SupportsIndex that might now be suspicious to me…

A literal string type for Python

Posted Apr 19, 2022 15:23 UTC (Tue) by gbleaney (guest, #158077) [Link] (2 responses)

PEP author here. Appendix B provides a trivial function for turning any regular external string into a 'LiteralString':
https://peps.python.org/pep-0675/#appendix-b-limitations

If a developer want to circumvent the protections of 'LiteralString', they can easily do it. They don't even need fancy functions like the example we gave, they can just add a '# pyre-ignore' (or equivalent lint suppression comment for their typechecker of choice). The goal is to protect against accidental mistakes, not malicious or implausible behaviour by developers.

A literal string type for Python

Posted Apr 24, 2022 13:39 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (1 responses)

I guess my problem is that I'm less confident the problematic case is "implausible".

If I'm correct the proof of course would likely arrive too late. ie, this PEP succeeds, everybody gets used to the behaviour as documented, and then a hole is found in some code, say, a popular Django app, where users can manipulate a LiteralString so as to cause mischief. I'm certain that the instinct will be to blame the app programmer, but of course that's missing the whole point of these protections, programmers are human and as such lack foresight.

To be quite fair, the other way forward can also be dangerous. In C++ for example std::format() resolutely insists on a constant format string, so that's pretty safe (it needn't be a literal, but it can't be sensitive to user input as that's not constant), but it necessitates providing std::vformat() which does not take a constant format string, and so programmers may be tempted to call std::vformat() rather than re-factor some code to ensure the format strings are actually constant... Defensive programming is possible, maybe even encouraged, but it's probably easier to do the Wrong Thing™ in many cases than it should be.

A literal string type for Python

Posted Apr 25, 2022 7:50 UTC (Mon) by farnz (subscriber, #17727) [Link]

To a large extent, though, these sound like the same problem as unsafe in Rust; sure, I can wrap all sorts of crawling horrors in unsafe, and have a Safe Rust API on top so that when you look at my crate's documentation, it's not obvious that I've done this.

And similar to Unsafe Rust, the answer is tool-assisted review of code you're planning to use that highlights the areas of code that need extra attention - just as a Rust-aware review system calls out unsafe wherever it appears for extra human attention, so a Python-aware review system needs to call out manipulation of LiteralString that results in a LiteralString typed output for extra human attention.

A literal string type for Python

Posted Apr 14, 2022 7:26 UTC (Thu) by ovitters (guest, #27950) [Link] (2 responses)

I didn't read the article, but the problem reminds me of why Bugzilla uses the taint mode of Perl. It was really handy to catch loads of user input handling bugs. Despite that, Bugzilla still was affected by a few user input security issues. Basically things that were marked as safe that (despite reviews) was eventually proven not to be safe. I really appreciated the taint mode and it's surprising how few programming languages lack a similar concept. This as handling user input is so common.

A literal string type for Python

Posted Apr 26, 2022 15:25 UTC (Tue) by nye (subscriber, #51576) [Link] (1 responses)

The problem with taint checking is that experience has shown that - even if it's always correct, which it often isn't - it leads a surprising number of programmers to assume that because the evil bit isn't set, then it must therefore be good. In other words, taint checking separates data into "definitely unsafe" and "might be safe assuming you're using it correctly, whatever that might mean", whereas many developers treat is as meaning "maybe unsafe" versus "definitely safe".

By restricting the feature to simply "is this a literal string, or derived from literal strings purely by means of concatenation"[0], the meaning is well-defined and easier to understand. In other words, it depends less upon programmer education, which is a strategy that has been repeatedly proven ineffective.

There is some discussion about this in https://wiki.php.net/rfc/is_literal if you're interested - that's the proposal for a very similar feature in PHP, which sadly did not pass for reasons I've not yet investigated.

[0] This PEP is a bit broader than that and does include some operations that create substrings, which makes me uncomfortable.

A literal string type for Python

Posted Nov 9, 2022 17:00 UTC (Wed) by craig.francis (guest, #162085) [Link]

Hi nye, bit weird to see you mention the PHP RFC for is_literal(), I'm the author :-)

I completely agree with everything you said - taint checking is flawed, concatenation is fine, and the extra functions PEP 675 include make me feel a bit uncomfortable as well (but, to be fair, I cannot think of a vulnerability from them, I just can't say with 100% confidence they will be fine for every single context).

Anyway... I'm just looking at how the Python implementation works, now 3.11 is out, because I need to go back to the PHP Internals Developers to try again.

As I'm not a Python developer, do you think the following is a good example of this feature being used:

https://github.com/craigfrancis/php-is-literal-rfc/blob/m...

https://eiv.dev/python-pyre/

---

As to reasons for the PHP RFC rejection... it was not clear, most people who voted against did not comment (bit weird, considering RFC stands for "Request for Comments"), two people didn't want it to support string concatenation (they believe it would help find issues, but I've found that hasn't been the case; instead it does make adoption much easier due to the amount of existing code that uses concatenation), three people believe these checks should only be done by Static Analysis (the most optimistic stat I can find is 33% of PHP developers use Static Analysis[0], which I support, and can now be done with the `literal-string` type in Psalm and PHPStan, but I don't believe it will ever get to 100%), one person believed this should be solved though better documentation... and someone thought the idea was flawed, because a *malicious* developer could write the user-value into a new PHP file (e.g. `<?php return "$user_value"; ?>`), and execute it.

[0] https://www.jetbrains.com/lp/devecosystem-2021/php/

A literal string type for Python

Posted Apr 14, 2022 8:39 UTC (Thu) by NRArnot (subscriber, #3033) [Link]

Django has something similar. When ordinary strings are rendered into an HTML template, the HTML special characters are escaped ("<" into "&lt;" etc.). The programmer can create "safe" strings (type Safestring) which behave like ordinary strings in most ways, but which do not get HTML-escaped. They remain safe until they are combined with ordinary strings by concatenation etc.

A literal string type for Python

Posted Apr 14, 2022 19:17 UTC (Thu) by iabervon (subscriber, #722) [Link] (1 responses)

A related thing I'd like to see would be something like ft"SELECT * FROM data WHERE user_id = {user_id}", that evaluates to ("SELECT * FROM data WHERE user_id = {}", ("user123",)), and then conn.execute((LiteralString, args)) could use the fact that Python has support for parsing format strings and working with the result in more interesting ways than making a new str. That is, it would replace each substitution with a "?" and add the value to the arguments, instead of replacing the substitution with the string representation of the value.

As a side note, secure coders these days have gotten really bad at writing insecure code. The example in the PEP just won't work at all, since it'll result in returning rows where data.user_id = data.user123 rather than ones where data.user_id = 'user123'. The insecure code that people would actually write is f"SELECT * FROM data WHERE user_id = '{user_id}';", with a set of single quotes that the attack could close.

A literal string type for Python

Posted Apr 17, 2022 11:30 UTC (Sun) by felix.s (guest, #104710) [Link]

There is already a PEP for that, PEP 501. It was (as I recall) drafted in parallel with PEP 498, but the former was deferred because the Python developers wanted to get ‘further experience’ with naïve interpolation, as if they failed to understand that this will just incentivise developers to introduce bugs into their code by reaching for the injection-prone naïve f-strings instead, the very thing they are trying to paper over here. Or perhaps, if we take the deferral rationale seriously, they deliberately aimed for the PHP approach, where they first introduce a simplistic, half-baked design into the language, only to have to resort to painful after-the-fact fixes over the following years.

Meanwhile, JavaScript has had an equivalent to PEP 501 (tagged template strings) from the very start, added at the same time as naïve interpolation. I still find it hardly ideal (it's a bit too easy to simply forget the tag), and say what you want about the TC39, but at least they seem to understand developer incentives well. At this point, and I can’t believe I am saying this, I am starting to see JavaScript as a better Python than Python.


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds