Rationalizing Python's APIs
CPython is the reference implementation of Python, so it is, unsurprisingly, the target for various language-extension modules. But the API and ABI it provides to those extensions ends up limiting what alternative Python implementations—and even CPython itself—can do, since those interfaces must continue to be supported. Beyond that, though, the interfaces are not clearly delineated, so changes can unexpectedly affect extensions that have come to depend on them. A recent thread on the python-ideas mailing list looks at how to clean that situation up.
On July 11, Victor Stinner floated a draft
of an as
yet unnumbered Python Enhancement Proposal (PEP) entitled "Hide
implementation details in the C
API
". The idea is to remove CPython implementation choices from
the API so that different experimental choices can be made while still
supporting the C-based extensions (NumPy and SciPy in particular). As he noted, other
attempts to provide an alternate Python implementation (e.g. PyPy), which are typically created to enhance
the language's
performance, have largely run aground because they cannot directly support
these all-important extensions.
In the draft, he mentioned a few possible options that could be tried if the C API was modified, including switching to indirect reference counts, removing the global interpreter lock (GIL), or changing the garbage-collection scheme. He described some of the history of Python forks and alternate implementations, many of which were blocked by the C API exposing too much of CPython's internals. The pre-PEP then went on to list some concrete steps toward splitting and rationalizing the Python C API.
To start, the Include/ directory for CPython would be split into three, one for each API: python for the existing C API, core for the internal API for CPython, and stable for the existing stable ABI (that extensions can rely on staying unchanged, though it is not really used, according to Stinner). Next up, the packaging tools would get an option for extensions to choose the API to use when they are built.
The final three steps would slowly move implementation details out of python, while still ensuring that extensions will build and function. That will require something of an iterative approach: alternately removing things from python and fixing the extensions. Eventually, the new restricted python API would be the default for all extensions. He also included an alternate path: leave the existing core API as the default, but provide an alternate API as an option at build time. That would mean that two Python binaries would be distributed for each release, one using the compatible API and another that would be faster but not compatible with all existing extensions.
Stinner prefaced his draft with some performance-related justifications, including a link to coverage of his 2017 Python Language Summit session. He is concerned about Python's performance and believes that the C API blocks various optimizations that might be applied to speed it up. He said:
Nick Coghlan took issue with the use of
"needed" with regard to performance improvements. He suggested that the
status quo is a result of people not recognizing one of the best ways to
increase the performance of a Python application: rewriting the
performance-critical pieces in another language. "[...] So folks
mistakenly think they need to rewrite
their whole application in something else, rather than just
selectively replacing key pieces of it.
" He pointed to Cython (which is used in parts of SciPy and
elsewhere) as a known way to get C-level
performance from Python. So there are differences of opinion about how
necessary these potential performance enhancements are, he said.
However the reorganization of the API to more clearly specify what is (and
is not) an external interface is "an admirable
goal
", Coghlan said, which will allow more experimentation as long
as there is no "hard compatibility break
". The C API has
"enabled
the Python ecosystem to become the powerhouse that it is
", but it is
difficult to maintain consistently. He continued:
Yes, Python is a nice language to program in, and it's great that we can get jobs where we can get paid to program in it. That doesn't mean we have to treat it as an existential threat that we aren't always going to be the best choice for everything :)
There was general agreement in the thread that reorganizing the header files and API was beneficial. Eric Snow pointed to some work he has done to consolidate the global variables in CPython into a single structure; it could perhaps be used as a starting point for the core API work that Stinner described. Barry Scott, who created PyCXX for writing Python extensions in C++, also liked the idea; he suggested adding a PyCXX-based extension into Stinner's testing regime.
Coghlan posted again, this time looking at more of the details in Stinner's proposal, rather than just the wording in his preamble. He reiterated some of his points about performance not being the best rationale for the initial cleanup work that Stinner is talking about. There is enough confusion in the APIs to justify the cleanup:
- We're not sure which APIs other projects (including extension module generators and helper libraries like Cython, Boost, PyCXX, SWIG, cffi, etc) are *actually* relying on.
- It's easy for us to accidentally expand the public C API without thinking about it, since Py_BUILD_CORE guards are opt-in and Py_LIMITED_API guards are opt-out
- We haven't structured our header files in a way that makes it obvious at a glance which API we're modifying (internal API, public API, stable ABI)
The guards that Coghlan refers to are supposed to restrict the symbols available to programs; Py_BUILD_CORE is for the interpreter and related tools (effectively what Stinner would put in core) and Py_LIMITED_API is for the stable ABI (and is badly named according to several in the thread). Coghlan suggested making all of that more clear before tackling further questions:
And then *after* we've done that API clarification work, *then* we can ask the question about what the default behaviour of "#include <Python.h>" should be, and perhaps introduce an opt-in Py_CPYTHON_API flag to request access to the full traditional C API for extension modules and embedding applications that actually need it.
In the proposal, Stinner said that, instead of including files between the different API directories, declarations should be duplicated in order to avoid mistakes in exposing declarations incorrectly. Duplication has its own set of dangers, however; Coghlan and others in the thread suggested a strict hierarchy of the APIs and their include files such that no duplication was needed but that definitions could not leak out into the other APIs incorrectly.
Along the way, it became clear that the "API" and "ABI" terms were being tossed around without a clear description of what the pieces are. Brett Cannon took a stab at defining the various levels:
- The stable A**B**I which is compatible across versions
- A stable A**P**I which hides enough details that if we change a struct your code won't require an update, just a recompile
- An API that exposes CPython-specific details such as structs and other details that might not be entirely portable to e.g. PyPy easily but that we try not to break
- An internal API that we use for implementing the interpreter but don't expect anyone else to use, so we can break it between feature releases (although if e.g. Cython chooses to use it they can)
Coghlan mostly agreed with that, but
thought that the portable API (#2 above) should still be able to change
over time, subject to the
standard Python deprecation policy.
He sees the portable API as only exposing interfaces that are genuinely
portable to, at least, PyPy. It would also stay as close to the stable ABI
as possible, "with additions
made *solely* to support the building of existing popular extension
modules
". So he sees the levels as follows:
- stable ABI (strict extension module compatibility policy)
- portable API (no ABI stability guarantees, normal deprecation policy)
- public CPython API (no cross-implementation portability guarantees)
- internal-only CPython core API (arbitrary changes, no deprecation warnings)
While Stinner's motivation may be different from others', it would seem that there is broad agreement that API rationalization is needed. How it all might look at a high level is also fairly non-controversial. An actual PEP that focuses strictly on the API clarification would seem to be the next step. Once that happens, assuming that it does, Stinner and others can start working on ways to make the portable API even more portable in support of various performance optimization experiments.
