A new major version of NumPy

By Daroc Alden
July 19, 2024

The NumPy project released version 2.0.0 on June 16, the first major release of the widely used Python-based numeric-computing library since 2006. The release has been planned for some time, as an opportunity to clean up NumPy's API. As with most NumPy updates, there are performance improvements to several individual functions. There are only a few new features, but several backward-incompatible changes, including a change to NumPy's numeric-promotion rules. Changes to the Python API require relatively minor changes to Python code using the library, but the changes to the C API may be more difficult to adapt to. In both cases, the official migration guide describes what needs to be adapted to the new version.

The 2.0.0 release was in development for 11 months, much longer than the typical NumPy release. In that time, 212 contributors sent 1078 pull requests — nearly a tenth of all pull requests made to date against the project's GitHub repository. The NumPy 2.0.0 transition manages to address many small problems across the API, while not creating too much work for users of the library.

Python

The most pervasive change that Python users of the library will need to be aware of is a change to the numeric-promotion rules — the rules NumPy uses when an operation combines numeric values of different types. This change has been under consideration since 2021, and serves to make NumPy's automatic numeric promotions more predictable. Prior to the change, NumPy would consider the value of a number when deciding whether to do a promotion. For example, in NumPy 1.24.0 (the most recent minor version prior to 2.0.0), adding a Python integer to an array produces an array of whatever type will fit the result:

    >>> import numpy as np   # NumPy is traditionally imported under the np alias
    >>> np.array([1, 2, 3], dtype=np.int8) + 1
    array([2, 3, 4], dtype=int8)
    >>> np.array([1, 2, 3], dtype=np.int8) + 256
    array([257, 258, 259], dtype=int16)

In the above example, that means that the array has been promoted to hold int16 values in order to avoid overflow. For the most part, this kind of value-dependent promotion does not cause too many problems, since the various algorithms NumPy provides are mostly implemented for all of the supported types. Still, there are some sharp edges, such as unexpected variations in performance or memory use depending on the values being processed. To help make the type of resulting values more predictable, NumPy 2.0.0 instead uses a system where the type of an output depends only on the types of the inputs. In particular, scalar Python values are now considered "weakly typed", and no longer influence the type of a result (other than potentially raising an error if the value would not fit):

    >>> np.array([1, 2, 3], dtype=np.int8) + 1
    array([2, 3, 4], dtype=int8)
    >>> np.array([1, 2, 3], dtype=np.int8) + 256
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    OverflowError: Python integer 256 out of bounds for int8
    >>> np.array([1, 2, 3], dtype=np.int8) + 127
    array([-128, -127, -126], dtype=int8)

Even when an error is not raised, adding Python values to NumPy arrays can now cause integer overflows. This change is obviously inconvenient for existing programs that could end up with unexpected exceptions or numerically incorrect results. Luckily, there is a way to test whether programs are affected before upgrading. NumPy version 1.24.0 (and subsequent versions) read the NPY_PROMOTION_STATE environment variable to decide whether to use the new behavior or not. A value of legacy preserves the old behavior. Setting the variable to weak selects the new behavior, while weak_and_warn will issue a warning whenever the old and new behaviors would differ.

This change brings NumPy in line with other Python libraries that generally use a similar integer-promotion system. There is an unofficial API standard for Python array libraries that specifies as much.

In NumPy 2.0.0, when two NumPy arrays with different data types are used, the output type is the larger of the two input types. In this diagram that accompanies the promotion proposal, that means the type that is further up or to the right:

Numeric promotion is not the only backward-incompatible change in the NumPy Python API, however. The project also took advantage of the major release to remove a number of deprecated aliases for various functions, and generally clean up the namespace of the library. The proposal justifies the removals by pointing out that having so many functions with similar names is a hurdle both for new users learning the library for the first time, and for the many projects such as compilers or specialized numeric libraries that need to be compatible with NumPy's API. The removals all have been deprecated for some time, and have the appealing property that code written against the NumPy 2.0.0 API will still run on NumPy 1.2x.x.

NumPy 2.0.0 also changes the pickle serialization format for NumPy objects. This change is technically backward-compatible, in that objects serialized by NumPy 2.0.0 are understood by NumPy 1.24.0 and vice versa, but any attempt to read them with a NumPy version older than 1.24.0 will raise an error. Still, the new format allows for objects larger than 4GB to be saved and results in a 5% performance improvement for large objects.

There have also been performance improvements to NumPy's sorting functions, although they are architecture-specific, and may not be useful to everyone. The release also adds support for macOS Accelerate, which significantly improves linear-algebra performance on macOS. With so many architecture-specific performance improvements, the project has also added a function that shows what hardware features NumPy has detected and can take advantage of, to better debug why an architecture-specific optimization does or does not apply.

The main completely new feature in the Python API is the addition of a new data type for handling variable-length strings in a NumPy array, along with a numpy.strings module containing functions for working with them efficiently.

Generally, code written for version 1.x may need updating, but code written for version 2.0.0 will work with older versions. In the rare case that a program needs to implement different code to work with different NumPy versions, the migration guide recommends using np.lib.NumpyVersion(np.__version__) to compare the current NumPy version to an appropriate constant.

C API

C code that builds against NumPy 1.24.0 will not dynamically link with NumPy 2.0.0, but the new version contains a compatibility header file that allows code built against NumPy 2.0.0 to link with NumPy 1.24.0. C code that needs to detect the NumPy version at run-time can use:

    if (PyArray_RUNTIME_VERSION >= NPY_2_0_API_VERSION) { ... }

The most pervasive change is to the definition of the PyArray_Descr structure that is used for interacting with NumPy arrays. In order to allow the library to be developed more quickly without breaking the C API, the project has made most of the structure off-limits, requiring the use of a set of accessor functions. Some fields, such as the type number (which encodes the data type of the array, one of the most common things to look up in the structure), remain directly accessible.

The project has also changed the type of complex numbers. In previous versions, complex numbers were represented with a structure. Now, they use the standard C99 _Complex types. Another difference is that NumPy has increased the maximum number of dimensions that can be present in an array from 32 to 64, and changed how it represents axis=None arguments in the C API. When a Python function did not specify the axis for an operation, it used to pass 32; now it's represented by -2147483648, the minimum value of a 32-bit integer.

In all, existing users of the NumPy library have some adapting to do. Hopefully the simplified API, more consistent type promotion rules, and new pickle format make the tradeoff worthwhile.

Ruff support for the transition

Posted Jul 21, 2024 21:49 UTC (Sun) by xyz (subscriber, #504) [Link]

One of the interesting aspects of this major version change is that they have released for some time a rule that is included in ruff:

https://numpy.org/devdocs/numpy_2_0_migration_guide.html#...

Yes, I am aware this only applies to the Python code but in any case it was nice to see this concern with users and with tools to help the transition.