|
|
Subscribe / Log in / New account

Escape sequences in Python strings

By Jake Edge
August 7, 2019

A change for Python 3.8—currently in beta—has produced some user-visible warnings, but the problem is often in code that a user cannot (or should not) change: third-party modules. The problem that the warning is trying to highlight is real, however. The upshot is that the handling of escape sequences (or non escape sequences, in truth) in Python string literals is in a rather messy state at this point.

Python string literals come in many different forms, but the main ones look like the following:

    a = 'foo'
    b = "bar"
    c = '''
    baz
    '''
The three-quote version allows for simpler multi-line strings and can use three double quotes instead if the programmer wants. But strings can also contain escape sequences, such as '\n' for newlines, '\t' for tabs, and so on. That means the backslash has a special meaning, so it needs to be escaped (i.e. '\\') if it is to be used literally, as well. A few other characters, notably a real newline or an embedded quote of the type used to delimit the string, also need to be backslash escaped.

But what to do about string literals with invalid escape sequences in them? A programmer who has put '\latex' as part of a string literal (to pick a not entirely random example) presumably actually wants '\\latex', which is what Python currently translates it to. Python does emit a DeprecationWarning in that case, but the warning was invisible by default until Python 3.7. However, that same programmer probably does not want '\tan(x)' to turn into a tab plus 'an(x)', but that is exactly what happens.

The change for Python 3.8 is to further elevate the warning to a SyntaxWarning, with plans to turn that into a SyntaxError in Python 3.9. A bug report filed in February 2018 shows the path of the change. But shortly after the Python 3.8 beta releases were made, Raymond Hettinger reported that he was seeing the warnings "pop up from time to time" from various third-party packages. Aaron Meurer concurred with Hettinger and pointed out a number of other problems he had encountered.

Serhiy Storchaka, who authored the patch for 3.8, replied that one of the examples given in the reports was in the PyParsing module, but that it showed a real problem that needed addressing, thus proving the usefulness of the change. A docstring in PyParsing referred to a regular expression, but did not escape the backslashes in the expression. The line in the docstring (delimited by ''') was:

    make_html = Regex(r"(\w+):(.*?):").sub(r"<\1>\2</\1>")
Interestingly, the \1 and \2 were interpreted as octal characters with the values of 1 and 2 respectively, so they were effectively treated as Unicode code points U+0001 and U+0002. But, the presence of \w, which is not a legal escape sequence, triggered the warning.

In mid-July, Hettinger reported a problem he encountered with the Bottle web framework, where, once again, the warning was showing up, but in code his students could not change. He said: "I think it is poor form to bombard end-users with warnings about things they can't fix." Storchaka pointed out that the problem in Bottle resulted from a bad backport of a change made five years ago.

But even though the warning became visible in Python 3.7, it was still difficult for projects to see. The problem is that the warning is emitted when the .py file is compiled into a .pyc file and never shown after that point. For many projects, as Meurer pointed out, that means it is quite difficult to actually test for the presence of the warning, even when the projects are diligently looking for it:

My biggest issue is that the way Python warns about this makes it very difficult for library authors to fix this. Most won't even notice. The problem is the warnings are only shown once, when the file is compiled. So if you run the code in any way that causes Python to compile it, a further run of 'python -Wd' will show nothing. I don't know if it's reasonable, but it would be nice if Python recompiled when given -Wd, or somehow saved the warning so it could be shown if that flag is given later.

As an anecdote, for SymPy's CI, we went through five (if I am counting correctly) iterations of trying to test this. Each of the first four were subtly incorrect, until we finally managed to find the correct one (for reference, 'python -We:invalid -m compileall -f -q module/'). So most library authors who will attempt to add tests against this will get it wrong. Simply adding -Wd as you would expect is wrong. If the code is already compiled (which it probably is, e.g., if you ran setup.py), it won't show the warnings. At the very least the "correct" way to test this should be documented.

That led Nathaniel Smith to argue that the deprecation cycle was not really followed for these strings. He suggested perhaps storing these kinds of warnings in the .pyc file so that they could be emitted later, and not just at compilation time. Around the same time, Hettinger elevated the visibility of the problem with a post to the python-dev mailing list:

This once seemed like a reasonable and innocuous idea to me; however, I've been using the 3.8 beta heavily for a month and no longer think it is a good idea. The warning crops up frequently, often due to third-party packages (such as docutils and bottle) that users can't easily do anything about. And during live demos and student workshops, it is especially distracting.

I now think our cure is worse than the disease. If code currently has a non-raw string with '\latex', do we really need Python to yelp about it (for 3.8) or reject it entirely (for 3.9)?

He also noted that ASCII art in docstrings was affected and that Python had been living with its current behavior for nearly 30 years without any huge downside. The main problem that warning for illegal escape sequences is trying to solve is for file names on Windows systems, where the backslash is used as a directory separator, he said. Many of those problems cannot even be caught by the warning; his example was '..\training\new_memo.doc', which would produce a corrupted file name, but no warning. One solution for those types of strings is to use Python's raw strings: r'..\training\new_memo.doc'.

But Chris Angelico was concerned that allowing illegal escape sequences to pass silently is problematic. He did a quick survey of some other languages and found that most gave a warning (or, in the case of Lua, an error), but that all of the others just ignored the spurious \ entirely, rather than turning it into \\ as Python does. "IMO Python's solution is better, but Lua's is best. A bad escape is an error."

In general, the consensus in the thread seems to be to slow down the process of turning the warning into something that users see, at least until the developers of modules in the Python Package Index (PyPI) can address the problem. There were lots of suggestions for ways to find these strings automatically in ways that will be seen by the developers, but those will need to wait for the Python 3.9 cycle.

The behavior with escape sequences has other wrinkles, as Steven D'Aprano pointed out. For example, escaped non-delimiting quotes are treated differently than other escape sequences:

    py> "abc \d \' xyz"
    "abc \\d ' xyz"
That seems like a bug to him, though it is documented that way. The current behavior also precludes adding new escape sequences without multi-year deprecation dances. For example, \e is long established as the escape sequence for ESC, but adding it to Python is problematic "because it would break code that relies on '\e' being the same as '\\e'".

So D'Aprano believes a change needs to be made, but that the current pace is too fast: "This isn't a critical problem that needs to be fixed soonest." But Angelico is worried about simply kicking the can down the road. He would like to see some active steps be taken so that developers can see and fix these problems. The more-visible warning added for 3.8 serves that purpose but, as D'Aprano noted, there will likely be some undesirable effects of that, beyond just user unhappiness. Users will file bugs, for example, often without looking to see if there is an existing bug, and they may or may not file the bug in the right place. Furthermore: "The benefit of the desired change is relatively low."

While D'Aprano is generally in favor of giving warnings (and, presumably, eventually errors) for these illegal escape sequences, he was not convinced that it has been adequately justified. He mentioned that it would allow new escape sequences, but did not find that to be of huge benefit. Angelico was adamant, however, that the problem is larger than just blocking new escape sequences:

Python users on Windows get EXTREMELY confused by the way their code worked perfectly with one path, then bizarrely fails with another. That is a very real problem, and the problem is that it appeared to work when actually it was wrong.

Python has a history of fixing these problems. It used to be that b"\x61\x62\x63\x64" was equal to u"abcd", but now Python sees these as fundamentally different. Data-dependent bugs caused by a syntactic oddity are a language flaw that needs to be fixed.

Storchaka has submitted a pull request to revert his change that elevated the DeprecationWarning to a SyntaxWarning. Based on the mood in the thread, one would guess that will be merged, but, if the problem is truly going to be addressed, additional work will be needed. No one has suggested dropping the DeprecationWarning, though some seem to think there is marginal utility to changing the long-established "escape illegal escape sequences" feature. But it is clear that doing so without adequate warning causes problems of a different—and more user-visible—sort.


Index entries for this article
PythonDeprecation
PythonStrings


to post comments

Escape sequences in Python strings

Posted Aug 7, 2019 22:12 UTC (Wed) by rillian (subscriber, #11344) [Link]

Nice article. I ran into this recently, complicated by the re module in python2 not expanding unicode escape sequences, making it difficult to use the raw string solution for some patterns.

At least the compile-time warning generation explains why it only showed up in pytest runs.

Escape sequences in Python strings

Posted Aug 8, 2019 10:10 UTC (Thu) by NAR (subscriber, #1313) [Link] (11 responses)

A programmer who has put '\latex' as part of a string literal (to pick a not entirely random example) presumably actually wants '\\latex', which is what Python currently translates it to.

This was a really bad idea that Python thinks it knows better than the programmer who writes the actual code. Python has no idea what the specifications are, so hasn't got a chance to "know better". I try to respect the "be polite, respectful, and informative", so I don't judge this decision further - but if they think this could be a problem, some lint-like tool could warn for "suspicious escape sequence". I really can't grasp why did they think that when the programmer writes "\latex", they know better what he wants, but when he writes "\textrm", they don't know better...

I also don't quite understand how Windows-style paths end up in string literals - they seems very non-portable and even if portability is not a concern for the author, some kind of filename join function to generate the path seems to be best practice.

Escape sequences in Python strings

Posted Aug 8, 2019 12:25 UTC (Thu) by rschroev (subscriber, #4164) [Link] (6 responses)

> I also don't quite understand how Windows-style paths end up in string literals - they seems very non-portable and even if portability is not a concern for the author, some kind of filename join function to generate the path seems to be best practice.

It seems it takes a kind of thoroughness that not all programmers have to write os.path.join(datadir, "assets", "images", "background.jpg") instead of os.path.join(datadir, "assets\images\background.jpg") or even datadir + "\assets\images\background.jpg".

It's not difficult to get it right (and it's become easier still since the introduction of pathlib in the standard library), but you have to know it, think of it, and apply it. Sadly that seems a step too far for too many people.

Escape sequences in Python strings

Posted Aug 8, 2019 18:08 UTC (Thu) by ScottMinster (subscriber, #67541) [Link] (5 responses)

> It seems it takes a kind of thoroughness that not all programmers have to write os.path.join(datadir, "assets", "images", "background.jpg") instead of os.path.join(datadir, "assets\images\background.jpg") or even datadir + "\assets\images\background.jpg".

Since Windows works with '/' as a directory separator just as well as '\', those programmers could do os.path.join(datadir, "assets/images/background.jpg") . People seem to go through a lot of hoops to use a backslash on Windows, but other than "\\servername" type usage, I don't think it is ever needed there.

OTOH, seeing a backslash'd path in a source file does let me know to be more cautious with the code contained elsewhere in it :)

Escape sequences in Python strings

Posted Aug 8, 2019 18:09 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

Thank you for saying what I was thinking more eloquently and politely than I could manage :)

Escape sequences in Python strings

Posted Aug 8, 2019 18:43 UTC (Thu) by rschroev (subscriber, #4164) [Link]

> Since Windows works with '/' as a directory separator just as well as '\', those programmers could do os.path.join(datadir, "assets/images/background.jpg") . People seem to go through a lot of hoops to use a backslash on Windows, but other than "\\servername" type usage, I don't think it is ever needed there.

Even //servername/sharename works with forward slashes; basically the whole win32 api accepts them.

Where forward slashes don't work is in cmd.exe, so something like os.system("dir C:/") won't work, and neither will subprocess.run(["dir", "c:/"], shell=True). That's not a big loss: except in very special use cases, there is always a better way to do it.

Escape sequences in Python strings

Posted Aug 10, 2019 7:02 UTC (Sat) by k8to (guest, #15413) [Link] (2 responses)

Although / often works as a path separator on windows, it doesn't always. Relying on it mostly-working is a bad practice.

There a variety of edge cases where things will start failing when you use this practice. IIRC, one example is when you end up having to use the long pathname cookie \\?\ then the codepaths seem less tolerant of forward slashes.

Escape sequences in Python strings

Posted Aug 10, 2019 17:19 UTC (Sat) by lsl (subscriber, #86508) [Link] (1 responses)

The os.path.join(datadir, "assets/images/background.jpg") mentioned above could emit the backslash-using question-marky variant as required (e.g. when the full path reaches a certain length) without users ever having to feed a backslash into it, no?

IIRC, this is also what the Go standard library's filepath package does.

Escape sequences in Python strings

Posted Aug 10, 2019 17:23 UTC (Sat) by k8to (guest, #15413) [Link]

When using the cookie, you need all backslashes. If you're using fixed forward slashes in your strings, you won't get that.

Escape sequences in Python strings

Posted Aug 10, 2019 6:59 UTC (Sat) by k8to (guest, #15413) [Link] (3 responses)

This pattern repeats over and over again. It's not really special to python.

If you have an escaping system wher \t is a tab and \\ is a backslash, then \l where \l is an unknown escape sequence should be an error. Making it equivalent to \\l is super super confusing, leads to errors, ratholes of unfixable problems, etc. It's just a bad idea. When implementating escaping systems, all escaped sequences should be defined or errors.

Escape sequences in Python strings

Posted Aug 10, 2019 16:25 UTC (Sat) by rschroev (subscriber, #4164) [Link] (2 responses)

It's one more indication that Postel's law aka the Robustness principle, despite best intentions, doesn't lead to robust systems. Instead it leads to sloppy coding which will come back and bite you in the ass.

Escape sequences in Python strings

Posted Aug 10, 2019 17:24 UTC (Sat) by k8to (guest, #15413) [Link]

I don't think Postel even would have advocated for being permissive for local data.

Escape sequences in Python strings

Posted Aug 15, 2019 8:06 UTC (Thu) by oldtomas (guest, #72579) [Link]

I think this is a strawman. Postel's law requires judgement: surprising behaviour isn't "fair game", i.e. if there isn't an "obvious" or "canonical" way to resolve an ambiguity, you better not try.

In the current case, the "canonical" way would be "what C does": "\c", for example, maps to "c", not "\\c", as Python woud do.

It's the same as what Perl does. But... not what bash does (this is more complicated, since you need an explicit step for backslash sequences to be interpreted -- the other named candidates interpret them at read time).

Given that situation, there are IMHO only two reasonable choices: "do as C" or "do as Lua".

Escape sequences in Python strings

Posted Aug 8, 2019 15:15 UTC (Thu) by bkw1a (subscriber, #4101) [Link] (8 responses)

I'm not a python programmer, so apologies if this is a dumb question, but why doesn't python differentiate between single quotes and double quotes, like perl does? In perl, these two things behave differently:

print '\textrm'."\n";
print "\textrm"."\n";

The first prints literally \textrm, and the second prints a tab followed by extrm.

Escape sequences in Python strings

Posted Aug 8, 2019 19:16 UTC (Thu) by brouhaha (subscriber, #1698) [Link] (6 responses)

Python uses rstrings (raw strings) for that, and the behavior is unaffected by whether single or double quotes are used.

print('\textrm')

will print a tab followed by extrm, while

print(r'\textrm')

will print \textrm

Back when Python was first created, it could have been argued that following the sh and/or Perl conventions for escaping inside single vs double quotes might have been reasonable, but it's far too late to introduce such a change now.

Escape sequences in Python strings

Posted Aug 8, 2019 20:04 UTC (Thu) by flussence (guest, #85566) [Link] (5 responses)

It's a simple difference in philosophy. Python's is “there should be only one obvious way to do it”, which is why the behaviour of strings is determined by a prefix character.

Perl has prefixes too (q qq qx qr) but also has the shell quote syntax, which doesn't lend itself well to maintainable code when you start mixing single/double quotes.

Escape sequences in Python strings

Posted Aug 9, 2019 3:59 UTC (Fri) by da4089 (subscriber, #1195) [Link] (4 responses)

In my experience, supporting both single and double quotes for string literals is high-profile violation of the "there should be only one obvious way to do it" philosophy.

Regardless of the convenience of "'" or '"', the mental load of choosing a quote type for every string is real: I'm frequently looking at the rest of the file to check what the usual convention is in this specific code, or going back and converting a bunch of literals that I've written using the wrong quotes.

As an unrepentant C programmer, I'm inclined to use single quotes for characters and doubles for strings, and that just creates even more pain.

Maybe deprecating one or the other could be a Python4 feature? /jk, kinda

Escape sequences in Python strings

Posted Aug 10, 2019 0:47 UTC (Sat) by mina86 (guest, #68442) [Link] (3 responses)

Simply use single quotes each time and the cognitive load is gone. As an added bonus, single quote is easier to type.

If you're willing to tolerate minor cognitive load, use double quotes if string contains single but not double quotes.

Finally, since Python doesn't have a character type, distinguishing between them and longer strings is counterproductive. Python is not C is not Algol.

Escape sequences in Python strings

Posted Aug 12, 2019 14:02 UTC (Mon) by Bluehorn (subscriber, #17484) [Link] (2 responses)

> As an added bonus, single quote is easier to type.

On an american keyboard, maybe. On a german qwertz keyboard, the double quote is above the digit 2 in the top row and the single quote is next to the big enter key. Which of course has (or maybe had) a different form on many keyboards which is why I prefer to use double quotes wherever possible.

Things are seldom simple in IT :-(

Greetings, Torsten

Escape sequences in Python strings

Posted Aug 12, 2019 14:19 UTC (Mon) by mina86 (guest, #68442) [Link]

No need to press shift. I still call it simpler to type. But I guess your millage may vary.

I’m using Programmer Dvorak anyway which has ‘ü’ easily accessible and that’s pretty much all German characters I need to type. ;)

Escape sequences in Python strings

Posted Aug 12, 2019 14:43 UTC (Mon) by mbunkus (subscriber, #87248) [Link]

Back in the day when I was writing a lot of LaTeX I had to switch from my native German keyboard layout to the English one. All of the important LaTeX characters such as the backslash and both pairs of curly & square brackets require an AltGr (right alt for all of you who don't know German keyboard layout) combination to type. After several years this caused a lot of pain both in my right thumb and my right wrist. I've continued using English layout (with custom bindinds for German Umalute & the ß) to this day, nearly twenty years later — and not only is LaTeX much easier on the joints, pretty much all of the programming languages I use regularly are.

If all you've ever done is using an English layout, count yourself lucky. A lot of us international users have it quite a bit harder.

Escape sequences in Python strings

Posted Aug 10, 2019 7:05 UTC (Sat) by k8to (guest, #15413) [Link]

I think the "why" is that it felt simpler as a teaching language when it was ABC.

In practice, I find it convenient to be able to do "Don't worry about embedded single quotes" and 'This example includes "Quotation Marks"' and have them both scan easily in source without escaping junk.

It does create a learning speedbump, I suppose, for those coming from other languages.


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds