Escape sequences in Python strings
A change for Python 3.8—currently in beta—has produced some user-visible warnings, but the problem is often in code that a user cannot (or should not) change: third-party modules. The problem that the warning is trying to highlight is real, however. The upshot is that the handling of escape sequences (or non escape sequences, in truth) in Python string literals is in a rather messy state at this point.
Python string literals come in many different forms, but the main ones look like the following:
a = 'foo' b = "bar" c = ''' baz '''The three-quote version allows for simpler multi-line strings and can use three double quotes instead if the programmer wants. But strings can also contain escape sequences, such as '\n' for newlines, '\t' for tabs, and so on. That means the backslash has a special meaning, so it needs to be escaped (i.e. '\\') if it is to be used literally, as well. A few other characters, notably a real newline or an embedded quote of the type used to delimit the string, also need to be backslash escaped.
But what to do about string literals with invalid escape sequences in them? A programmer who has put '\latex' as part of a string literal (to pick a not entirely random example) presumably actually wants '\\latex', which is what Python currently translates it to. Python does emit a DeprecationWarning in that case, but the warning was invisible by default until Python 3.7. However, that same programmer probably does not want '\tan(x)' to turn into a tab plus 'an(x)', but that is exactly what happens.
The change for Python 3.8 is to further elevate the warning to a
SyntaxWarning, with plans to turn that into a SyntaxError
in Python 3.9. A bug
report filed in February 2018 shows the path of the change.
But shortly after the Python 3.8 beta releases were made, Raymond
Hettinger reported that he
was seeing the warnings "pop up from time to time
" from
various third-party packages. Aaron Meurer concurred with
Hettinger and pointed out a number of other problems he had encountered.
Serhiy Storchaka, who authored the patch for 3.8, replied that one of the examples given in the reports was in the PyParsing module, but that it showed a real problem that needed addressing, thus proving the usefulness of the change. A docstring in PyParsing referred to a regular expression, but did not escape the backslashes in the expression. The line in the docstring (delimited by ''') was:
make_html = Regex(r"(\w+):(.*?):").sub(r"<\1>\2</\1>")Interestingly, the \1 and \2 were interpreted as octal characters with the values of 1 and 2 respectively, so they were effectively treated as Unicode code points U+0001 and U+0002. But, the presence of \w, which is not a legal escape sequence, triggered the warning.
In mid-July, Hettinger reported a problem
he encountered with the Bottle
web framework, where, once again, the warning was showing up, but in code
his students could not change. He said: "I
think it is poor form to bombard end-users with warnings about things they
can't fix.
" Storchaka pointed out that
the problem in Bottle resulted from a bad
backport of a change
made five years ago.
But even though the warning became visible in Python 3.7, it was still difficult for projects to see. The problem is that the warning is emitted when the .py file is compiled into a .pyc file and never shown after that point. For many projects, as Meurer pointed out, that means it is quite difficult to actually test for the presence of the warning, even when the projects are diligently looking for it:
As an anecdote, for SymPy's CI, we went through five (if I am counting correctly) iterations of trying to test this. Each of the first four were subtly incorrect, until we finally managed to find the correct one (for reference, 'python -We:invalid -m compileall -f -q module/'). So most library authors who will attempt to add tests against this will get it wrong. Simply adding -Wd as you would expect is wrong. If the code is already compiled (which it probably is, e.g., if you ran setup.py), it won't show the warnings. At the very least the "correct" way to test this should be documented.
That led Nathaniel Smith to argue that the deprecation cycle was not really followed for these strings. He suggested perhaps storing these kinds of warnings in the .pyc file so that they could be emitted later, and not just at compilation time. Around the same time, Hettinger elevated the visibility of the problem with a post to the python-dev mailing list:
I now think our cure is worse than the disease. If code currently has a non-raw string with '\latex', do we really need Python to yelp about it (for 3.8) or reject it entirely (for 3.9)?
He also noted that ASCII art in docstrings was affected and that Python had been living with its current behavior for nearly 30 years without any huge downside. The main problem that warning for illegal escape sequences is trying to solve is for file names on Windows systems, where the backslash is used as a directory separator, he said. Many of those problems cannot even be caught by the warning; his example was '..\training\new_memo.doc', which would produce a corrupted file name, but no warning. One solution for those types of strings is to use Python's raw strings: r'..\training\new_memo.doc'.
But Chris Angelico was concerned
that allowing illegal escape sequences to pass silently is problematic. He
did a quick survey of some other languages and found that most gave a
warning (or, in the case of Lua, an error), but that all of the others just
ignored the spurious \ entirely, rather than turning it into
\\ as Python does. "IMO Python's
solution is better, but Lua's is best. A bad escape is an error.
"
In general, the consensus in the thread seems to be to slow down the process of turning the warning into something that users see, at least until the developers of modules in the Python Package Index (PyPI) can address the problem. There were lots of suggestions for ways to find these strings automatically in ways that will be seen by the developers, but those will need to wait for the Python 3.9 cycle.
The behavior with escape sequences has other wrinkles, as Steven D'Aprano pointed out. For example, escaped non-delimiting quotes are treated differently than other escape sequences:
py> "abc \d \' xyz" "abc \\d ' xyz"
because it would break code that relies on '\e' being the same as '\\e'".
So D'Aprano believes a change needs to be made, but that the current pace
is too fast: "This isn't a critical problem that needs to be fixed
soonest.
" But Angelico is worried
about simply kicking the can down the road. He would
like to see some
active steps be taken so that developers can see and fix these problems.
The more-visible warning added for 3.8 serves that purpose but, as
D'Aprano noted,
there will likely be some undesirable effects of that, beyond just user
unhappiness. Users will file bugs, for example, often without looking to
see if there is an existing bug, and they may or may not file the bug in
the right place. Furthermore: "The benefit of the desired change is
relatively low.
"
While D'Aprano is generally in favor of giving warnings (and, presumably, eventually errors) for these illegal escape sequences, he was not convinced that it has been adequately justified. He mentioned that it would allow new escape sequences, but did not find that to be of huge benefit. Angelico was adamant, however, that the problem is larger than just blocking new escape sequences:
Python has a history of fixing these problems. It used to be that b"\x61\x62\x63\x64" was equal to u"abcd", but now Python sees these as fundamentally different. Data-dependent bugs caused by a syntactic oddity are a language flaw that needs to be fixed.
Storchaka has submitted a pull request to revert his change that elevated the DeprecationWarning to a SyntaxWarning. Based on the mood in the thread, one would guess that will be merged, but, if the problem is truly going to be addressed, additional work will be needed. No one has suggested dropping the DeprecationWarning, though some seem to think there is marginal utility to changing the long-established "escape illegal escape sequences" feature. But it is clear that doing so without adequate warning causes problems of a different—and more user-visible—sort.
Index entries for this article | |
---|---|
Python | Deprecation |
Python | Strings |
Posted Aug 7, 2019 22:12 UTC (Wed)
by rillian (subscriber, #11344)
[Link]
Nice article. I ran into this recently, complicated by the At least the compile-time warning generation explains why it only showed up in pytest runs.
Posted Aug 8, 2019 10:10 UTC (Thu)
by NAR (subscriber, #1313)
[Link] (11 responses)
This was a really bad idea that Python thinks it knows better than the programmer who writes the actual code. Python has no idea what the specifications are, so hasn't got a chance to "know better". I try to respect the "be polite, respectful, and informative", so I don't judge this decision further - but if they think this could be a problem, some lint-like tool could warn for "suspicious escape sequence". I really can't grasp why did they think that when the programmer writes "\latex", they know better what he wants, but when he writes "\textrm", they don't know better...
I also don't quite understand how Windows-style paths end up in string literals - they seems very non-portable and even if portability is not a concern for the author, some kind of filename join function to generate the path seems to be best practice.
Posted Aug 8, 2019 12:25 UTC (Thu)
by rschroev (subscriber, #4164)
[Link] (6 responses)
It seems it takes a kind of thoroughness that not all programmers have to write os.path.join(datadir, "assets", "images", "background.jpg") instead of os.path.join(datadir, "assets\images\background.jpg") or even datadir + "\assets\images\background.jpg".
It's not difficult to get it right (and it's become easier still since the introduction of pathlib in the standard library), but you have to know it, think of it, and apply it. Sadly that seems a step too far for too many people.
Posted Aug 8, 2019 18:08 UTC (Thu)
by ScottMinster (subscriber, #67541)
[Link] (5 responses)
Since Windows works with '/' as a directory separator just as well as '\', those programmers could do os.path.join(datadir, "assets/images/background.jpg") . People seem to go through a lot of hoops to use a backslash on Windows, but other than "\\servername" type usage, I don't think it is ever needed there.
OTOH, seeing a backslash'd path in a source file does let me know to be more cautious with the code contained elsewhere in it :)
Posted Aug 8, 2019 18:09 UTC (Thu)
by mpr22 (subscriber, #60784)
[Link]
Posted Aug 8, 2019 18:43 UTC (Thu)
by rschroev (subscriber, #4164)
[Link]
Even //servername/sharename works with forward slashes; basically the whole win32 api accepts them.
Where forward slashes don't work is in cmd.exe, so something like os.system("dir C:/") won't work, and neither will subprocess.run(["dir", "c:/"], shell=True). That's not a big loss: except in very special use cases, there is always a better way to do it.
Posted Aug 10, 2019 7:02 UTC (Sat)
by k8to (guest, #15413)
[Link] (2 responses)
There a variety of edge cases where things will start failing when you use this practice. IIRC, one example is when you end up having to use the long pathname cookie \\?\ then the codepaths seem less tolerant of forward slashes.
Posted Aug 10, 2019 17:19 UTC (Sat)
by lsl (subscriber, #86508)
[Link] (1 responses)
IIRC, this is also what the Go standard library's filepath package does.
Posted Aug 10, 2019 17:23 UTC (Sat)
by k8to (guest, #15413)
[Link]
Posted Aug 10, 2019 6:59 UTC (Sat)
by k8to (guest, #15413)
[Link] (3 responses)
If you have an escaping system wher \t is a tab and \\ is a backslash, then \l where \l is an unknown escape sequence should be an error. Making it equivalent to \\l is super super confusing, leads to errors, ratholes of unfixable problems, etc. It's just a bad idea. When implementating escaping systems, all escaped sequences should be defined or errors.
Posted Aug 10, 2019 16:25 UTC (Sat)
by rschroev (subscriber, #4164)
[Link] (2 responses)
Posted Aug 10, 2019 17:24 UTC (Sat)
by k8to (guest, #15413)
[Link]
Posted Aug 15, 2019 8:06 UTC (Thu)
by oldtomas (guest, #72579)
[Link]
In the current case, the "canonical" way would be "what C does": "\c", for example, maps to "c", not "\\c", as Python woud do.
It's the same as what Perl does. But... not what bash does (this is more complicated, since you need an explicit step for backslash sequences to be interpreted -- the other named candidates interpret them at read time).
Given that situation, there are IMHO only two reasonable choices: "do as C" or "do as Lua".
Posted Aug 8, 2019 15:15 UTC (Thu)
by bkw1a (subscriber, #4101)
[Link] (8 responses)
print '\textrm'."\n";
The first prints literally \textrm, and the second prints a tab followed by extrm.
Posted Aug 8, 2019 19:16 UTC (Thu)
by brouhaha (subscriber, #1698)
[Link] (6 responses)
print('\textrm')
will print a tab followed by extrm, while
print(r'\textrm')
will print \textrm
Back when Python was first created, it could have been argued that following the sh and/or Perl conventions for escaping inside single vs double quotes might have been reasonable, but it's far too late to introduce such a change now.
Posted Aug 8, 2019 20:04 UTC (Thu)
by flussence (guest, #85566)
[Link] (5 responses)
Perl has prefixes too (q qq qx qr) but also has the shell quote syntax, which doesn't lend itself well to maintainable code when you start mixing single/double quotes.
Posted Aug 9, 2019 3:59 UTC (Fri)
by da4089 (subscriber, #1195)
[Link] (4 responses)
Regardless of the convenience of "'" or '"', the mental load of choosing a quote type for every string is real: I'm frequently looking at the rest of the file to check what the usual convention is in this specific code, or going back and converting a bunch of literals that I've written using the wrong quotes.
As an unrepentant C programmer, I'm inclined to use single quotes for characters and doubles for strings, and that just creates even more pain.
Maybe deprecating one or the other could be a Python4 feature? /jk, kinda
Posted Aug 10, 2019 0:47 UTC (Sat)
by mina86 (guest, #68442)
[Link] (3 responses)
If you're willing to tolerate minor cognitive load, use double quotes if string contains single but not double quotes.
Finally, since Python doesn't have a character type, distinguishing between them and longer strings is counterproductive. Python is not C is not Algol.
Posted Aug 12, 2019 14:02 UTC (Mon)
by Bluehorn (subscriber, #17484)
[Link] (2 responses)
On an american keyboard, maybe. On a german qwertz keyboard, the double quote is above the digit 2 in the top row and the single quote is next to the big enter key. Which of course has (or maybe had) a different form on many keyboards which is why I prefer to use double quotes wherever possible.
Things are seldom simple in IT :-(
Greetings, Torsten
Posted Aug 12, 2019 14:19 UTC (Mon)
by mina86 (guest, #68442)
[Link]
I’m using Programmer Dvorak anyway which has ‘ü’ easily accessible and that’s pretty much all German characters I need to type. ;)
Posted Aug 12, 2019 14:43 UTC (Mon)
by mbunkus (subscriber, #87248)
[Link]
If all you've ever done is using an English layout, count yourself lucky. A lot of us international users have it quite a bit harder.
Posted Aug 10, 2019 7:05 UTC (Sat)
by k8to (guest, #15413)
[Link]
In practice, I find it convenient to be able to do "Don't worry about embedded single quotes" and 'This example includes "Quotation Marks"' and have them both scan easily in source without escaping junk.
It does create a learning speedbump, I suppose, for those coming from other languages.
Escape sequences in Python strings
re
module in python2 not expanding unicode escape sequences, making it difficult to use the raw string solution for some patterns.
A programmer who has put '\latex' as part of a string literal (to pick a not entirely random example) presumably actually wants '\\latex', which is what Python currently translates it to.
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
print "\textrm"."\n";
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings
Escape sequences in Python strings