A literal string type for Python
Using strings with contents that are supplied by users can be fraught with peril; SQL injection is a well-known technique for attacking applications that stems from that, for example. Generally, database frameworks and libraries provide mechanisms that seek to lead programmers toward doing The Right Thing, with parameterized queries and the like, but they cannot enforce that—inventive developers will seemingly always find ways to inject user input into places it should not go. A recently adopted Python Enhancement Proposal (PEP) provides a way to enforce the use of strings that are untainted by user input, but it uses the optional typing features of the language to do so; those wanting to take advantage of it will need to be running a type-checking program.
PEP 675
("Arbitrary Literal String Type
") flew under the radar to a
certain extent. It was discussed
on the Python typing-sig mailing list, mostly back in January of this year,
then posted
to the python-dev mailing list in February, where there was little
discussion. In March, it was accepted
by the steering council for inclusion into Python 3.11, which is due
in October. Gregory P. Smith had some interesting thoughts when he
announced the acceptance on behalf of the council:
TL;DR - PEP 675 allows type checkers to help prevent bugs allowing attacker-controlled data to be passed to APIs that declare themselves as requiring literal, in-code strings.This is a very thorough PEP with a compelling and highly relevant set of use cases. If I tried to call out all the things we like about it, it'd turn into a table of contents. It is long, but everything has a reason to be there. :)
Once implemented, we expect it to be a challenge to tighten widely used existing APIs that accept str today to be LiteralString for practical reasons of what existing code calling unrestricted APIs naturally does. The community would benefit from anyone who attempts to move a widely used existing str API to LiteralString sharing their experiences, successful or not.
As he notes, the feature will not magically fix SQL-injection vulnerabilities (or other, similar problems); it will take time to properly annotate existing APIs, for one thing. Beyond that, applications will need to be processed using one of the available Python static type checkers (e.g. mypy or pytype) and then be fixed to no longer work around the restrictions on passing literal strings.
SQL injection
The Motivation section of the PEP describes the problem well; here we adapt and condense some of its examples. Code vulnerable to a SQL injection might look something like the following:
def query_user(conn: Connection, user_id: str) -> User:
query = f"SELECT * FROM data WHERE user_id = {user_id}"
conn.execute(query)
...
query_user(conn, "user123")
query_user(conn, "user123; DROP TABLE data;")
query_user(conn, "user123 OR 1 = 1")
The query_user() function may seem reasonable at first glance and the first use of it works just fine. The other two show how the call could be misused if, for example, the user to be searched for is being read from a web form. The second call would delete the data table, while the third would retrieve all users (since 1 = 1 is always true). Anyone running a web application on today's web will be all-too-familiar with messages in their logs showing efforts to exploit this kind of problem. Of course, xkcd has also highlighted the problem in its inimitable style.
Database libraries generally have a mechanism to avoid that kind of programming error through the use of query parameters. For example:
def query_user(conn: Connection, user_id: str) -> User:
query = "SELECT * FROM data WHERE user_id = ?"
conn.execute(query, (user_id,))
...
When developers use that mechanism, the database library makes the substitution for "?" in a safe way; instead of tacking on the extra stuff as additional SQL, it will look for users with names that are exactly what is in the string passed to query_user(). If developers would use query parameters everywhere, the problem would largely be solved, but the library cannot enforce that type of usage. It simply gets a string that it needs to execute; the library's documentation can admonish developers not to dynamically build the query string with user input, but that clearly has not stopped them from doing so.
So the PEP adds a LiteralString type to the typing module to allow developers to annotate variables, function parameters, return values, and more to indicate that the string value must be composed of literal values. So, as an example from the PEP shows, the execute() member function for a database API might be defined as follows:
from typing import LiteralString
def execute(self, sql: LiteralString, parameters: Iterable[str] = ...) -> Cursor: ...
That would cause uses of non-literal strings as parameters to the function to provoke a warning or error from the type checker:
def query_user(conn: Connection, user_id: str) -> User:
query = f"SELECT * FROM data WHERE user_id = {user_id}"
conn.execute(query)
# Error: Expected LiteralString, got str.
In this example, the query string has been built from other data, thus it may be susceptible to SQL-injection problems. An interpolated Python f-string is not a literal string if it is built with regular str variables, so a type checker can infer that query is not either, thus it can complain. There is a somewhat common pattern, which is often completely benign, that the PEP also takes into account. For example:
def query_user(conn: Connection, user_id: str, limit: bool) -> User:
query = "SELECT * FROM data WHERE user_id = ?"
if limit:
query += " LIMIT 1"
conn.execute(query, (user_id,))
In that version of query_user(), query is still a LiteralString even after the limit clause has been appended, because concatenating two LiteralString types is a LiteralString. One could even expand that further by turning the "LIMIT 1" into "LIMIT ?" and building up a list of parameter values to pass to execute(). Well-written code already does those sorts of things, so the adjustment to the addition of LiteralString annotations should be minimal—for those code bases, at least. Those projects will benefit when someone slips up and inserts a potential injection into the code; the next time the type checker is run, it will loudly point out the problem.
Meanwhile, the technique can be applied more widely, as the PEP indicates:
LiteralString is also useful in other cases where we want strict command-data separation, such as when building shell commands or when rendering a string into an HTML response without escaping (see Appendix A: Other Uses). Overall, this combination of strictness and flexibility makes it easy to enforce safer API usage in sensitive code without burdening users.
Why use a type?
At first glance, using the type system to enforce this kind of behavior
might seem like an odd choice, but the Rationale section of
the PEP describes the other options and shows why the type system actually
makes the most sense, at least for Python. A run-time approach would need
to rely on heuristics that are imperfect (and highly use-case dependent).
A static analyzer that looked at the abstract syntax tree to try to spot
problems of this sort would be overly restrictive because it cannot
determine "when a string is assigned to an intermediate variable or
when it is transformed by a benign function
". The type checker is
better placed:
The type checker, surprisingly, does better than both because it has access to information not available in the runtime or static analysis approaches. Specifically, the type checker can tell us whether an expression has a literal string type, say Literal["foo"]. The type checker already propagates types across variable assignments or function calls.
The Literal type has been present since Python 3.8. It allows specifying the legal values for a given type; for example, a parameter that can only be a few specific values might be: Literal["a", "b", "c"]. A parameter of that type that was passed "d" would fail the type check, as would passing a regular string type:
def foo(x: Literal["a", "b"]) -> None: ...
foo("a") # works
a_var = "a string"
foo(a_var) # type failure
So Literal might be useful in parts of the application, but a SQL query can contain any string, so LiteralString was born:
LiteralString is the "supertype" of all literal string types. In effect, this PEP just introduces a type in the type hierarchy between Literal["foo"] and str. Any particular literal string, such as Literal["foo"] or Literal["bar"], is compatible with LiteralString, but not the other way around. The "supertype" of LiteralString itself is str. So, LiteralString is compatible with str, but not the other way around.
Due to overloading the types allowed for many of the standard str methods, as described in Appendix C, the LiteralString type can be preserved through various kinds of string transformations. For example:
def bar(x: LiteralString, y: LiteralString) -> LiteralString:
return ", ".join(x,y)
That works to join the two parameters with a ", " because that separator string is literal, as it is not composed from other strings, and the join() arguments also have the type LiteralString. It might seem like the language is "casting" the value to the return type, but that is not happening here; Python type annotations are not used by the language at run time. That kind of type overloading also applies to f-strings, so those can still produce a LiteralString:
def baz(x: List[LiteralString], y: List[LiteralString]) -> LiteralString:
xout = " ".join(x)
yout = "+".join(y)
return f"({xout}) '{yout}'"
baz(["a", "b"], ["c", "d"]) # produces "(a b) 'c+d'" as a LiteralString
So the two lists of literal strings are combined using join() and an f-string to produce something that is still a LiteralString. There is a multi-line example of using that in SQL-query context in the Rejected Alternatives section of the PEP; it turns out that some existing tools for catching SQL-injection errors cannot support that kind of benign query-string construction. Note that the first example in the article could perhaps be made to "work" by declaring user_id as a LiteralString instead of a str, though passing data that came in from the outside would fail a type check without some kind of gyrations by the programmer.
Those kinds of gyrations are explicitly listed in the Limitations section. There will always be ways to evade the checks—by not running a type checker to start with—but the PEP is not meant to stop that kind of thing:
[...] ultimately a clever, malicious developer attempting to circumvent the protections offered by LiteralString will always succeed. The important thing to remember is that LiteralString is not intended to protect against malicious developers; it is meant to protect against benign developers accidentally using sensitive APIs in a dangerous way (without getting in their way otherwise).Without LiteralString, the best enforcement tool API authors have is documentation, which is easily ignored and often not seen. With LiteralString, API misuse requires conscious thought and artifacts in the code that reviewers and future developers can notice.
Reception
In a reply to the announcement of the PEP's acceptance, Neil Schemenauer applauded the idea, noting that he had done something along the same lines years ago:
I did something like this for HTML templating in the Quixote web framework (to avoid XSS [cross-site scripting] bugs). I did it as a special kind of module with a slightly different compiler (using AST transform). With the LiteralString feature, I can implement the same kind of thing directly in Python.
Quixote has an htmltext class for strings that are considered safe because they have already been properly escaped for HTML, unlike regular strings which are generally treated as still needing to be escaped. The template system that he describes provides a way to generate HTML directly from functions, while escaping parameters and the like, so that cross-site scripting problems are avoided. Quixote was one of the earlier Python web frameworks and is still in use, including on this site.
Unlike with some other PEPs, there has not been a lot of discussion of its merits of the idea or changes needed in the text of the PEP. Some of that was done in the typing-sig thread, but much of it was something of a bikeshed exercise for the name of the type. As Smith noted, the PEP authors, Pradeep Kumar Srinivasan and Graham Bleaney, along with the core developer sponsor, Jelle Zijlstra, have done a nice job in creating a compelling case for the feature. In addition, they did so in a way that allows well-behaved code to just continue to work.
Over the years, the Python typing features have grown and matured since
their introduction as "type hints" back in
2014. PEP 484
("Type Hints
") added the feature to Python 3.5 and every
release since then has added more to the feature. There is a long
list of typing PEPs that have been adopted; PEP 675 will soon join the
party.
While the typing features (and the use of type checkers) is optional for Python, the language is clearly headed toward more widespread use of types. The rise of typed Python has led to some discomfort in the ecosystem that types are starting a slow move toward dominating decisions for the language—to the point where those who are not interested in types, per se, are being left behind. So far, at least, that does not really seem to be the case; the steering council has been trying to balance the needs of the different groups of users and has generally been successful in doing so.
There are, of course, many constituents in the pro-type camp. Some larger applications have already retrofitted type features into their code and more of that work is ongoing. Any new Python project, especially one that looks likely to grow into a significant code base, should likely carefully consider adding a type checker into the mix. There are, it seems, lots of benefits and few real downsides at this point. LiteralString will only add to those benefits over time.
| Index entries for this article | |
|---|---|
| Python | Python Enhancement Proposals (PEP)/PEP 675 |
| Python | Security |
| Python | Static typing |
