A literal string type for Python
Using strings with contents that are supplied by users can be fraught with peril; SQL injection is a well-known technique for attacking applications that stems from that, for example. Generally, database frameworks and libraries provide mechanisms that seek to lead programmers toward doing The Right Thing, with parameterized queries and the like, but they cannot enforce that—inventive developers will seemingly always find ways to inject user input into places it should not go. A recently adopted Python Enhancement Proposal (PEP) provides a way to enforce the use of strings that are untainted by user input, but it uses the optional typing features of the language to do so; those wanting to take advantage of it will need to be running a type-checking program.
PEP 675
("Arbitrary Literal String Type
") flew under the radar to a
certain extent. It was discussed
on the Python typing-sig mailing list, mostly back in January of this year,
then posted
to the python-dev mailing list in February, where there was little
discussion. In March, it was accepted
by the steering council for inclusion into Python 3.11, which is due
in October. Gregory P. Smith had some interesting thoughts when he
announced the acceptance on behalf of the council:
TL;DR - PEP 675 allows type checkers to help prevent bugs allowing attacker-controlled data to be passed to APIs that declare themselves as requiring literal, in-code strings.This is a very thorough PEP with a compelling and highly relevant set of use cases. If I tried to call out all the things we like about it, it'd turn into a table of contents. It is long, but everything has a reason to be there. :)
Once implemented, we expect it to be a challenge to tighten widely used existing APIs that accept str today to be LiteralString for practical reasons of what existing code calling unrestricted APIs naturally does. The community would benefit from anyone who attempts to move a widely used existing str API to LiteralString sharing their experiences, successful or not.
As he notes, the feature will not magically fix SQL-injection vulnerabilities (or other, similar problems); it will take time to properly annotate existing APIs, for one thing. Beyond that, applications will need to be processed using one of the available Python static type checkers (e.g. mypy or pytype) and then be fixed to no longer work around the restrictions on passing literal strings.
SQL injection
The Motivation section of the PEP describes the problem well; here we adapt and condense some of its examples. Code vulnerable to a SQL injection might look something like the following:
def query_user(conn: Connection, user_id: str) -> User: query = f"SELECT * FROM data WHERE user_id = {user_id}" conn.execute(query) ... query_user(conn, "user123") query_user(conn, "user123; DROP TABLE data;") query_user(conn, "user123 OR 1 = 1")
The query_user() function may seem reasonable at first glance and the first use of it works just fine. The other two show how the call could be misused if, for example, the user to be searched for is being read from a web form. The second call would delete the data table, while the third would retrieve all users (since 1 = 1 is always true). Anyone running a web application on today's web will be all-too-familiar with messages in their logs showing efforts to exploit this kind of problem. Of course, xkcd has also highlighted the problem in its inimitable style.
Database libraries generally have a mechanism to avoid that kind of programming error through the use of query parameters. For example:
def query_user(conn: Connection, user_id: str) -> User: query = "SELECT * FROM data WHERE user_id = ?" conn.execute(query, (user_id,)) ...
When developers use that mechanism, the database library makes the substitution for "?" in a safe way; instead of tacking on the extra stuff as additional SQL, it will look for users with names that are exactly what is in the string passed to query_user(). If developers would use query parameters everywhere, the problem would largely be solved, but the library cannot enforce that type of usage. It simply gets a string that it needs to execute; the library's documentation can admonish developers not to dynamically build the query string with user input, but that clearly has not stopped them from doing so.
So the PEP adds a LiteralString type to the typing module to allow developers to annotate variables, function parameters, return values, and more to indicate that the string value must be composed of literal values. So, as an example from the PEP shows, the execute() member function for a database API might be defined as follows:
from typing import LiteralString def execute(self, sql: LiteralString, parameters: Iterable[str] = ...) -> Cursor: ...
That would cause uses of non-literal strings as parameters to the function to provoke a warning or error from the type checker:
def query_user(conn: Connection, user_id: str) -> User: query = f"SELECT * FROM data WHERE user_id = {user_id}" conn.execute(query) # Error: Expected LiteralString, got str.
In this example, the query string has been built from other data, thus it may be susceptible to SQL-injection problems. An interpolated Python f-string is not a literal string if it is built with regular str variables, so a type checker can infer that query is not either, thus it can complain. There is a somewhat common pattern, which is often completely benign, that the PEP also takes into account. For example:
def query_user(conn: Connection, user_id: str, limit: bool) -> User: query = "SELECT * FROM data WHERE user_id = ?" if limit: query += " LIMIT 1" conn.execute(query, (user_id,))
In that version of query_user(), query is still a LiteralString even after the limit clause has been appended, because concatenating two LiteralString types is a LiteralString. One could even expand that further by turning the "LIMIT 1" into "LIMIT ?" and building up a list of parameter values to pass to execute(). Well-written code already does those sorts of things, so the adjustment to the addition of LiteralString annotations should be minimal—for those code bases, at least. Those projects will benefit when someone slips up and inserts a potential injection into the code; the next time the type checker is run, it will loudly point out the problem.
Meanwhile, the technique can be applied more widely, as the PEP indicates:
LiteralString is also useful in other cases where we want strict command-data separation, such as when building shell commands or when rendering a string into an HTML response without escaping (see Appendix A: Other Uses). Overall, this combination of strictness and flexibility makes it easy to enforce safer API usage in sensitive code without burdening users.
Why use a type?
At first glance, using the type system to enforce this kind of behavior
might seem like an odd choice, but the Rationale section of
the PEP describes the other options and shows why the type system actually
makes the most sense, at least for Python. A run-time approach would need
to rely on heuristics that are imperfect (and highly use-case dependent).
A static analyzer that looked at the abstract syntax tree to try to spot
problems of this sort would be overly restrictive because it cannot
determine "when a string is assigned to an intermediate variable or
when it is transformed by a benign function
". The type checker is
better placed:
The type checker, surprisingly, does better than both because it has access to information not available in the runtime or static analysis approaches. Specifically, the type checker can tell us whether an expression has a literal string type, say Literal["foo"]. The type checker already propagates types across variable assignments or function calls.
The Literal type has been present since Python 3.8. It allows specifying the legal values for a given type; for example, a parameter that can only be a few specific values might be: Literal["a", "b", "c"]. A parameter of that type that was passed "d" would fail the type check, as would passing a regular string type:
def foo(x: Literal["a", "b"]) -> None: ... foo("a") # works a_var = "a string" foo(a_var) # type failure
So Literal might be useful in parts of the application, but a SQL query can contain any string, so LiteralString was born:
LiteralString is the "supertype" of all literal string types. In effect, this PEP just introduces a type in the type hierarchy between Literal["foo"] and str. Any particular literal string, such as Literal["foo"] or Literal["bar"], is compatible with LiteralString, but not the other way around. The "supertype" of LiteralString itself is str. So, LiteralString is compatible with str, but not the other way around.
Due to overloading the types allowed for many of the standard str methods, as described in Appendix C, the LiteralString type can be preserved through various kinds of string transformations. For example:
def bar(x: LiteralString, y: LiteralString) -> LiteralString: return ", ".join(x,y)
That works to join the two parameters with a ", " because that separator string is literal, as it is not composed from other strings, and the join() arguments also have the type LiteralString. It might seem like the language is "casting" the value to the return type, but that is not happening here; Python type annotations are not used by the language at run time. That kind of type overloading also applies to f-strings, so those can still produce a LiteralString:
def baz(x: List[LiteralString], y: List[LiteralString]) -> LiteralString: xout = " ".join(x) yout = "+".join(y) return f"({xout}) '{yout}'" baz(["a", "b"], ["c", "d"]) # produces "(a b) 'c+d'" as a LiteralString
So the two lists of literal strings are combined using join() and an f-string to produce something that is still a LiteralString. There is a multi-line example of using that in SQL-query context in the Rejected Alternatives section of the PEP; it turns out that some existing tools for catching SQL-injection errors cannot support that kind of benign query-string construction. Note that the first example in the article could perhaps be made to "work" by declaring user_id as a LiteralString instead of a str, though passing data that came in from the outside would fail a type check without some kind of gyrations by the programmer.
Those kinds of gyrations are explicitly listed in the Limitations section. There will always be ways to evade the checks—by not running a type checker to start with—but the PEP is not meant to stop that kind of thing:
[...] ultimately a clever, malicious developer attempting to circumvent the protections offered by LiteralString will always succeed. The important thing to remember is that LiteralString is not intended to protect against malicious developers; it is meant to protect against benign developers accidentally using sensitive APIs in a dangerous way (without getting in their way otherwise).Without LiteralString, the best enforcement tool API authors have is documentation, which is easily ignored and often not seen. With LiteralString, API misuse requires conscious thought and artifacts in the code that reviewers and future developers can notice.
Reception
In a reply to the announcement of the PEP's acceptance, Neil Schemenauer applauded the idea, noting that he had done something along the same lines years ago:
I did something like this for HTML templating in the Quixote web framework (to avoid XSS [cross-site scripting] bugs). I did it as a special kind of module with a slightly different compiler (using AST transform). With the LiteralString feature, I can implement the same kind of thing directly in Python.
Quixote has an htmltext class for strings that are considered safe because they have already been properly escaped for HTML, unlike regular strings which are generally treated as still needing to be escaped. The template system that he describes provides a way to generate HTML directly from functions, while escaping parameters and the like, so that cross-site scripting problems are avoided. Quixote was one of the earlier Python web frameworks and is still in use, including on this site.
Unlike with some other PEPs, there has not been a lot of discussion of its merits of the idea or changes needed in the text of the PEP. Some of that was done in the typing-sig thread, but much of it was something of a bikeshed exercise for the name of the type. As Smith noted, the PEP authors, Pradeep Kumar Srinivasan and Graham Bleaney, along with the core developer sponsor, Jelle Zijlstra, have done a nice job in creating a compelling case for the feature. In addition, they did so in a way that allows well-behaved code to just continue to work.
Over the years, the Python typing features have grown and matured since
their introduction as "type hints" back in
2014. PEP 484
("Type Hints
") added the feature to Python 3.5 and every
release since then has added more to the feature. There is a long
list of typing PEPs that have been adopted; PEP 675 will soon join the
party.
While the typing features (and the use of type checkers) is optional for Python, the language is clearly headed toward more widespread use of types. The rise of typed Python has led to some discomfort in the ecosystem that types are starting a slow move toward dominating decisions for the language—to the point where those who are not interested in types, per se, are being left behind. So far, at least, that does not really seem to be the case; the steering council has been trying to balance the needs of the different groups of users and has generally been successful in doing so.
There are, of course, many constituents in the pro-type camp. Some larger applications have already retrofitted type features into their code and more of that work is ongoing. Any new Python project, especially one that looks likely to grow into a significant code base, should likely carefully consider adding a type checker into the mix. There are, it seems, lots of benefits and few real downsides at this point. LiteralString will only add to those benefits over time.
Index entries for this article | |
---|---|
Python | Python Enhancement Proposals (PEP)/PEP 675 |
Python | Security |
Python | Static typing |
Posted Apr 14, 2022 1:20 UTC (Thu)
by milesrout (subscriber, #126894)
[Link] (10 responses)
Posted Apr 14, 2022 16:31 UTC (Thu)
by tialaramex (subscriber, #21167)
[Link] (9 responses)
"EAT BABIES" // clearly expresses your intent, it's your fault, you wrote that.
"EAT" + " BABIES" // I can see why they felt like they should make this work, they've cited examples that do this and it genuinely is still clear at this point what your intent was, although I think it should be discouraged anyway.
doComplicatedStuffBasedOnUserInput("DO NOT EAT BABIES", input) // this still type checks as LiteralString, and might be EAT BABIES yet we can hardly claim now that we're reflecting clear programmer intent when that happens.
The reason to want literals here rather than allowing arbitrary strings is to get closer to requiring intent. I'd rather give up the second example than, as this PEP does allow the third example opportunity to set fire to everything and pretend that's "safe".
Rust of necessity has to require actual literals in formatting (not merely constant strings) because the formatting work is done via the macro system, and the macro system can't see inside variables. But I think even though more sophisticated behaviour would be welcomed by many Rust programmers I personally prefer the literal requirement.
Posted Apr 14, 2022 16:45 UTC (Thu)
by mb (subscriber, #50428)
[Link] (8 responses)
Posted Apr 15, 2022 20:28 UTC (Fri)
by tialaramex (subscriber, #21167)
[Link] (7 responses)
Posted Apr 16, 2022 8:56 UTC (Sat)
by mb (subscriber, #50428)
[Link] (6 responses)
It's not supposed to prevent the programmer from hardcoding the wrong string.
Posted Apr 16, 2022 21:28 UTC (Sat)
by tialaramex (subscriber, #21167)
[Link] (5 responses)
doComplicatedStuffBasedOnUserInput("DO NOT EAT BABIES", input)
"DO NOT EAT BABIES" is blessed as a LiteralString because it is. No problem so far. But "cleverly" this proposal allows operations (such as truncation, concatenation, duplication and splitting) on LiteralString to produce a LiteralString, and so if doComplicatedStuffBasedOnUserInput has a bug, as it may well do, it can end up producing quite unexpected results, such as "EAT BABIES" and yet they're blessed as LiteralString anyway via this rationale.
Thus, the program user in fact gets arbitrary control over these strings in at least some cases, whereas that's definitively not the situation in languages where there's an actual literal string type. In exchange, Python gets to write "WO" + "RDS" and have that be a LiteralString whereas in the other languages it is not. I think that's a bad trade, despite being very clever.
Posted Apr 19, 2022 4:58 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (4 responses)
Posted Apr 19, 2022 11:06 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
If there were LiteralNumber, one might be able to do that, but without, there's no difference between a literal 7 and a 7 coming in from "the outside" through a variable. Though there are a number of other methods that take SupportsIndex that might now be suspicious to me…
Posted Apr 19, 2022 15:23 UTC (Tue)
by gbleaney (guest, #158077)
[Link] (2 responses)
If a developer want to circumvent the protections of 'LiteralString', they can easily do it. They don't even need fancy functions like the example we gave, they can just add a '# pyre-ignore' (or equivalent lint suppression comment for their typechecker of choice). The goal is to protect against accidental mistakes, not malicious or implausible behaviour by developers.
Posted Apr 24, 2022 13:39 UTC (Sun)
by tialaramex (subscriber, #21167)
[Link] (1 responses)
If I'm correct the proof of course would likely arrive too late. ie, this PEP succeeds, everybody gets used to the behaviour as documented, and then a hole is found in some code, say, a popular Django app, where users can manipulate a LiteralString so as to cause mischief. I'm certain that the instinct will be to blame the app programmer, but of course that's missing the whole point of these protections, programmers are human and as such lack foresight.
To be quite fair, the other way forward can also be dangerous. In C++ for example std::format() resolutely insists on a constant format string, so that's pretty safe (it needn't be a literal, but it can't be sensitive to user input as that's not constant), but it necessitates providing std::vformat() which does not take a constant format string, and so programmers may be tempted to call std::vformat() rather than re-factor some code to ensure the format strings are actually constant... Defensive programming is possible, maybe even encouraged, but it's probably easier to do the Wrong Thing™ in many cases than it should be.
Posted Apr 25, 2022 7:50 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
To a large extent, though, these sound like the same problem as unsafe in Rust; sure, I can wrap all sorts of crawling horrors in unsafe, and have a Safe Rust API on top so that when you look at my crate's documentation, it's not obvious that I've done this.
And similar to Unsafe Rust, the answer is tool-assisted review of code you're planning to use that highlights the areas of code that need extra attention - just as a Rust-aware review system calls out unsafe wherever it appears for extra human attention, so a Python-aware review system needs to call out manipulation of LiteralString that results in a LiteralString typed output for extra human attention.
Posted Apr 14, 2022 7:26 UTC (Thu)
by ovitters (guest, #27950)
[Link] (2 responses)
Posted Apr 26, 2022 15:25 UTC (Tue)
by nye (subscriber, #51576)
[Link] (1 responses)
By restricting the feature to simply "is this a literal string, or derived from literal strings purely by means of concatenation"[0], the meaning is well-defined and easier to understand. In other words, it depends less upon programmer education, which is a strategy that has been repeatedly proven ineffective.
There is some discussion about this in https://wiki.php.net/rfc/is_literal if you're interested - that's the proposal for a very similar feature in PHP, which sadly did not pass for reasons I've not yet investigated.
[0] This PEP is a bit broader than that and does include some operations that create substrings, which makes me uncomfortable.
Posted Nov 9, 2022 17:00 UTC (Wed)
by craig.francis (guest, #162085)
[Link]
I completely agree with everything you said - taint checking is flawed, concatenation is fine, and the extra functions PEP 675 include make me feel a bit uncomfortable as well (but, to be fair, I cannot think of a vulnerability from them, I just can't say with 100% confidence they will be fine for every single context).
Anyway... I'm just looking at how the Python implementation works, now 3.11 is out, because I need to go back to the PHP Internals Developers to try again.
As I'm not a Python developer, do you think the following is a good example of this feature being used:
https://github.com/craigfrancis/php-is-literal-rfc/blob/m...
---
As to reasons for the PHP RFC rejection... it was not clear, most people who voted against did not comment (bit weird, considering RFC stands for "Request for Comments"), two people didn't want it to support string concatenation (they believe it would help find issues, but I've found that hasn't been the case; instead it does make adoption much easier due to the amount of existing code that uses concatenation), three people believe these checks should only be done by Static Analysis (the most optimistic stat I can find is 33% of PHP developers use Static Analysis[0], which I support, and can now be done with the `literal-string` type in Psalm and PHPStan, but I don't believe it will ever get to 100%), one person believed this should be solved though better documentation... and someone thought the idea was flawed, because a *malicious* developer could write the user-value into a new PHP file (e.g. `<?php return "$user_value"; ?>`), and execute it.
Posted Apr 14, 2022 8:39 UTC (Thu)
by NRArnot (subscriber, #3033)
[Link]
Posted Apr 14, 2022 19:17 UTC (Thu)
by iabervon (subscriber, #722)
[Link] (1 responses)
As a side note, secure coders these days have gotten really bad at writing insecure code. The example in the PEP just won't work at all, since it'll result in returning rows where data.user_id = data.user123 rather than ones where data.user_id = 'user123'. The insecure code that people would actually write is f"SELECT * FROM data WHERE user_id = '{user_id}';", with a set of single quotes that the attack could close.
Posted Apr 17, 2022 11:30 UTC (Sun)
by felix.s (guest, #104710)
[Link]
There is already a PEP for that, PEP 501. It was (as I recall) drafted in parallel with PEP 498, but the former was deferred because the Python developers wanted to get ‘further experience’ with naïve interpolation, as if they failed to understand that this will just incentivise developers to introduce bugs into their code by reaching for the injection-prone naïve f-strings instead, the very thing they are trying to paper over here. Or perhaps, if we take the deferral rationale seriously, they deliberately aimed for the PHP approach, where they first introduce a simplistic, half-baked design into the language, only to have to resort to painful after-the-fact fixes over the following years.
Meanwhile, JavaScript has had an equivalent to PEP 501 (tagged template strings) from the very start, added at the same time as naïve interpolation. I still find it hardly ideal (it's a bit too easy to simply forget the tag), and say what you want about the TC39, but at least they seem to understand developer incentives well. At this point, and I can’t believe I am saying this, I am starting to see JavaScript as a better Python than Python.
A literal string type for Python
A literal string type for Python
A literal string type for Python
A literal string type for Python
A literal string type for Python
A literal string type for Python
A literal string type for Python
A literal string type for Python
A literal string type for Python
https://peps.python.org/pep-0675/#appendix-b-limitations
A literal string type for Python
A literal string type for Python
A literal string type for Python
A literal string type for Python
A literal string type for Python
A literal string type for Python
A literal string type for Python
A literal string type for Python