LWN.net Weekly Edition for February 25, 2021

Welcome to the LWN.net Weekly Edition for February 25, 2021

This edition contains the following feature content:

A pair of Python vulnerabilities: two security issues lead to a set of fast-track Python releases.
How useful should copy_file_range() be?: a disagreement over how an obscure Linux system call should work.
An introduction to lockless algorithms: the first in a multipart series on managing concurrency without locks.
5.12 Merge window, part 1: what was in the first 8,500 patches merged for the 5.12 kernel.
NumPy 1.20 has been released: a look at NumPy and what's in the 1.20 release.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

A pair of Python vulnerabilities

By Jake Edge
February 24, 2021

Two separate vulnerabilities led to the fast-tracked release of Python 3.9.2 and 3.8.8 on February 19, though source-only releases of 3.7.10 and 3.6.13 came a few days earlier. The vulnerabilities may be problematic for some Python users and workloads; one could potentially lead to remote code execution. The other is, arguably, not exactly a flaw in the Python standard library—it simply also follows an older standard—but it can lead to web cache poisoning attacks.

Overflowing

The first vulnerability addressed by the updates is CVE-2021-3177, which is a buffer overflow that was reported in a bug filed on January 16. The problem occurs when the ctypes module, which provides C-compatible data types for Python, is used to convert a floating-point number into a string with a non-exponential approximation of its value. A sufficiently large number will crash the interpreter, as this proof of concept (adapted from the one in the bug report) shows:

    >>> from ctypes import *
    >>> c_double.from_param(1e30)
    <cparam 'd' (1000000000000000019884624838656.000000)>
    >>> c_double.from_param(1e300)
    *** buffer overflow detected ***: terminated
    Aborted

As the error message indicates, the problem is a buffer overflow, specifically because the sprintf() C library function, which "prints" a formatted string to a buffer, assumes that the buffer provided is large enough—with predictable results. Problems with sprintf() overflowing buffers probably go back over 50 years at this point, but there are safer alternatives. Bug reporter Jordy Zomer suggested using snprintf(), which takes a size parameter; it will not write more than size-1 characters to the string plus the terminating NUL.

The offending code was also part of the report:

    case 'd':
        sprintf(buffer, "<cparam '%c' (%f)>",
            self->tag, self->value.d);
        break;

The "%f" format specifier requests the string form of the float value, so passing a really large number (1e300, say) will create a huge string that overflows the 256-byte, stack-based buffer. If a program was using that conversion on user-controlled input, code execution could be the result, since an attacker has some control of what the overflow writes to the stack. If that user-controlled input comes from across the network, of course, remote code execution is possible. However, on the python-dev mailing list, Stephen J. Turnbull expressed some doubts that the bug was actually all that exploitable; he thought it was probably not really necessary to rush out releases to address it.

Once the bug was reported, Benjamin Peterson fixed the problem in multiple Python branches in short order. Instead of using snprintf(), though, he used a combination of PyFloat_FromDouble() and PyUnicode_FromFormat() for the specific problem reported. As can be seen in the January 18 patch, he also changed other calls to sprintf() in the ctypes module to use PyUnicode_FromFormat() as well. Instead of taking a buffer, PyUnicode_FromFormat() calculates the size of the resulting string, allocates it, and returns it.

As the Python security bug-tracking page shows, the problem was fixed in two days. Roughly a month later, all four currently supported versions of Python were released. While Python 2.7 is no longer supported by the core developers, some distributions are still updating it for security problems; in this case the problem is only found in Python 3.x, so backporting a fix was not necessary. [Update: As pointed out in an email from Moritz Muehlenhoff, Python 2.7 actually is affected by this bug. He notes that python2 on Debian 10 ("Buster") is affected and has been updated. Also, Fedora has a fix in progress for its python2.7 package.]

Poisoning the web cache

The second flaw addressed in the releases is CVE-2021-23336, which is a poisoning attack against web caches due to a difference in interpretation of the separator for URL query parameters. The older HTML4 spec allowed both ";" and "&" to separate query parameters, but HTML5 only allows the latter. The urllib.parse module in the standard library followed the older spec, but that could lead to problems with web caches.

According to the bug tracking page, the problem was reported to the Python security response team (PSRT) in October 2020, but it became public in a Snyk blog post on January 18 and a bug report on January 19, both by Adam Goldschmidt.

The blog post is a wealth of information about the problem of web cache poisoning and how some Python web frameworks can fall prey to the problem. The basic idea is that web queries are often cached for performance reasons using tools like Squid. The key for the cache entry is typically part of the URL; if another request matches that key, the cached version is served rather than requesting it from the host. That can reduce the response time both because the remote host is not contacted, reducing the network latency, and because whatever processing is required on that host is not done, eliminating the application latency.

There are different kinds of web cache poisoning, but the one that affects Python revolves around the two servers, cache and web, interpreting the URL differently. Python-based web applications that use urllib.parse.parse_qsl() would see both the semicolons and ampersands as separating query parameters, but the cache servers typically do not. In addition, there are a set of query parameters that are usually not used as part of the cache key (e.g. utm_content and other utm_* parameters). That allows attackers to make a query of this sort:

    https://example.com/search/?q=common_term&utm_content=s;q=malicious_term

The web cache sees a query that as two parameters, one of which is not used in the key, but the Python application running on the web server sees it as three parameters, with the second q overriding the value of the first. Now the web cache has an entry for "common_term" that leads to cached results for "malicious_term". Until the cache entry expires, users are not getting what they expect at all—and "hilarity" ensues. In truth, of course, it can result in a wide variety of unpleasant, confusing, and potentially dangerous attacks.

The obvious solution is for web applications to stop processing the semicolon as a separator, but up until the changes merged February 14 (and released shortly thereafter), there was no way to force that behavior in urllib.parse. That patch changed the library to default to only allowing the ampersand to separate query parameters. It also added a new, optional separator argument to parse_qsl() (and the related parse_qs()) that would allow developers to choose a different separator—semicolon perhaps—if they choose. It should be noted, though, that there is no facility for having more than one separator, so restoring the existing behavior of accepting both query separators is not possible.

As might be guessed, the fix itself is pretty straightforward. The bulk of the patch consists of documentation and test changes (here too). Unlike the buffer overflow, though, this bug also affects Python 2.7, as noted on the Red Hat tracking page.

It could be argued that Python is not entirely at fault here. As mentioned in the bug report, the W3C did recommend the use of the semicolon in the HTML4 spec, though only in an appendix rather than in the main body of the spec. There were also a few people who thought that changing the default (and removing the ability to handle both separators) was not the right path forward, but urllib maintainer Senthil Kumaran decided that it was best to make the fix in urllib and to do so in a way to avoid confusing web proxies.

There is nothing particularly noteworthy about these two bugs in particular, but they do provide a bit of a look into the Python security process. The buffer overflow was addressed rather quickly, but the web cache poisoning problem took rather longer to resolve. It is not entirely clear what happened in the three months between the PSRT report and the public disclosure of the bug; after that, though, the developers moved fairly rapidly. Beyond that, it is always useful to take a peek at the kinds of vulnerabilities that are arising; if nothing else, it can help keep folks on the lookout for other, similar problem areas.

Comments (7 posted)

How useful should copy_file_range() be?

By Jonathan Corbet
February 18, 2021

The copy_file_range() system call looks like a relatively straightforward feature; it allows user space to ask the kernel to copy a range of data from one file to another, hopefully applying some optimizations along the way. In truth, this call has never been as generic as it seems, though some changes made during 5.3 helped in that regard. When the developers of the Go language ran into problems with copy_file_range(), there ensued a lengthy discussion on how this system call should work and whether the kernel needs to do more to make it useful.

The definition of copy_file_range() is:

    ssize_t copy_file_range(int fd_in, loff_t *off_in,
                            int fd_out, loff_t *off_out,
                            size_t len, unsigned int flags);

Its job is to copy len bytes of data from the file represented by fd_in to fd_out, observing the requested offsets at both ends. The flags argument must be zero. This call first appeared in the 4.5 release. Over time it turned out to have a number of unpleasant bugs, leading to a long series of fixes and some significant grumbling along the way.

In 2019 Amir Goldstein fixed more issues and, in the process, removed a significant limitation: until then, copy_file_range() refused to copy between files that were not located on the same filesystem. After this patch was merged (for 5.3), it could copy between any two files, falling back on splice() for the cross-filesystem case. It appeared that copy_file_range() was finally settling into a solid and useful system call.

Indeed, it seemed useful enough that the Go developers decided to use it for the io.Copy() function in their standard library. Then they ran into a problem: copy_file_range() will, when given a kernel-generated file as input, copy zero bytes of data and claim success. These files, which include files in /proc, tracefs, and a large range of other virtual filesystems, generally indicate a length of zero when queried with a system call like stat(). copy_file_range(), seeing that zero length, concludes that there is no data to copy and the job is already done; it then returns success.

But there is actually data to be read from this kind of file, it just doesn't show in the advertised length of the file; the real length often cannot be known before the file is actually read. Before 5.3, the prohibition on cross-filesystem copies would have caused most such attempts to return an error code; afterward, they fail but appear to work. The kernel is happy, but some users can be surprisingly stubborn about actually wanting to copy the data they asked to be copied; they were rather less happy.

Marking virtual filesystems

Nicolas Boichat tried to mollify those users with this patch set to copy_file_range(). It added a flag (FS_GENERATED_CONTENT) to the file_system_type structure for virtual filesystems where the length of the files cannot be known in advance. copy_file_range() would then look for that flag and return an error code when it was found; the error return would cause io.Copy() to fall back to a manual copy operation. This change appeared to solve the immediate problem, but it is not destined to be merged into the mainline.

There were a few objections, starting with the fact that it requires all virtual filesystems to be specially marked. The patch set did not mark them all, and this mechanism would require that developers be sure to mark all future filesystems as they were added. As Greg Kroah-Hartman put it: "That way lies madness and constant auditing that I do not see anyone signing up for for the next 20 years".

The bigger question, though, was whether this behavior should be seen as a bug at all. Boichat described it as a regression; code that would fall back to a normal copy before 5.3 would silently fail to copy data thereafter. Kroah-Hartman was unsure, though; he continued:

Why are people trying to use copy_file_range on simple /proc and /sys files in the first place? They can not seek (well most can not), so that feels like a "oh look, a new syscall, let's use it everywhere!" problem that userspace should not do.

Dave Chinner was, if anything, less sympathetic:

It is a targeted solution for *regular files only* on filesystems that store persistent data and can accelerate the data copy in some way (e.g. clone, server side offload, hardware offload, etc). It is not intended as a copy mechanism for copying data from one random file descriptor to another.

The use of it as a general file copy mechanism in the Go system library is incorrect and wrong. It is a userspace bug. Userspace has done the wrong thing, userspace needs to be fixed.

The problem with this attitude, as described by Go developer Ian Lance Taylor, is that figuring out when copy_file_range() can be used is not easy; he pointed out that these limitations are not mentioned in the copy_file_range() man page, and argued that this behavior reduces the utility of the system call considerably:

From my perspective, as a kernel user rather than a kernel developer, a system call that silently fails for certain files and that provides no way to determine either 1) ahead of time that the system call will fail, or 2) after the call that the system call did fail, is a useless system call. I can never use that system call, because I don't know whether or not it will work.

Chinner said that the test is whether it is possible to tell whether a file has data in it without calling read() on it. But Darrick Wong, hardly a filesystem amateur, replied: "I don't know how to do that, Dave. :)" There is another fun twist, as Boichat pointed out: files in sysfs, rather than indicating a zero length, claim to be 4,096 bytes long — regardless of their true length, which may be larger or smaller than that. Chinner's test will fail on those files, even if it can be reliably carried out.

Toward a real fix

Wong went on to express agreement with the Go developers: copy_file_range() should either work as expected or return an error so that user space can know to fall back to copying the old-fashioned way. He also suggested a couple of ways to possibly fix the problem, the first of which was to go back to the previous state of affairs, where cross-filesystem copies were explicitly disallowed. Failing that, one could restrict such copies to a single filesystem type that has explicit support for them. Luis Henriques implemented a variant of that idea, where copies across filesystems would still be allowed if the two filesystems were of the same type, and if the filesystem involved explicitly implements the copy_file_range() operation.

That patch was stopped in its tracks, though, after Trond Myklebust pointed out that the kernel's NFS daemon uses the copy_file_range() mechanism to copy files between filesystems of different types. Blocking that would break some important functionality; this usage pattern exists in other filesystems, such as Ceph and FUSE, as well. In response to that, Henriques added a new flag (COPY_FILE_SPLICE) that could be used within the kernel to indicate that a cross-filesystem-type copy should be performed. There was some question of whether this flag should be made available to user space for cases when it somehow knows that the operation would succeed, but it seems that will not happen.

A final version of this patch has not been posted as of this writing, but the eventual shape of the fix seems clear. When called from user space, copy_file_range() will only try to copy a file across filesystems if the two are of the same type, and if that filesystem has explicit support for the system call (and, thus, is presumably written with all of the possible cases in mind). Otherwise, the call will fail with an explicit error, so user space will know that it must copy the data some other way. So copy_file_range() will never be a generic file-copy mechanism, but it will at least be possible to use robustly in code that is prepared for it to fail.

There is still one more trap lurking within copy_file_range(), though. Like most I/O-related system calls, copy_file_range() can copy fewer bytes than requested; user space needs to check the return value to see how much work was actually done. There is currently no way to distinguish between copies that were cut short on the read side (by hitting the end of the file, perhaps) and those that were stopped on the write side (which may well indicate a write error). Nobody has come up with a real solution to that problem yet.

All of this goes to show how a seemingly simple interface can quickly become complex. copy_file_range() has revealed a number of sharp edges over its relatively short existence; there may well be more yet to be found. It is thus perhaps unsurprising that the kernel developers, having been burned more than once, feel a strong desire to keep its implementation as simple as possible.

Comments (26 posted)

An introduction to lockless algorithms

February 19, 2021

This article was contributed by Paolo Bonzini

Lockless algorithms are of interest for the Linux kernel when traditional locking primitives either cannot be used or are not performant enough. For this reason they come up every now and then on LWN; one of the last mentions, which prompted me to write this article series, was last July. Topics that arise even more frequently are read-copy-update (RCU — these articles from 2007 are still highly relevant), reference counting, and ways of wrapping lockless primitives into higher-level, more easily understood APIs. These articles will delve into the concepts behind lockless algorithms and how they are used in the kernel.

Other articles in this series:

Low-level knowledge of the memory model is universally recognized as advanced material that can scare even the most seasoned kernel hackers; our editor wrote (in the July article) that "it takes a special kind of mind to really understand the memory model". It's been said that the Linux kernel memory model (and in particular Documentation/memory-barriers.txt) can be used to frighten small children, and the same is probably true of just the words "acquire" and "release".

At the same time, mechanisms like RCU and seqlocks are in such widespread use in the kernel that almost every developer will sooner or later encounter fundamentally lockless programming interfaces. For this reason, it is a good idea to equip yourself with at least a basic understanding of lockless primitives. Throughout this series I will describe what acquire and release semantics are really about, and present five relatively simple patterns that alone can cover most uses of the primitives.

Acquire, release, and "happens before"

In order to make the big step from the (relative) comfort of synchronization primitives to lockless programming, we shall first take a look at why locks work in the first place. Usually, this is taught in terms of mutual exclusion: locks prevent multiple threads of execution from reading or writing the same data concurrently. But what does "concurrently" really mean? And what happens when thread T is done with that data and thread U starts using it?

In order to answer these questions, we can turn to a theoretical framework that Leslie Lamport established in his 1978 paper "Time, Clocks and the Ordering of Events in a Distributed System". According to the paper, the events in a distributed system can be ordered according to whether an event P happens before another event Q:

The ordering of events is total within a single thread of execution. In layman's terms, for any two events from the same thread you can always say which came first and which came second.
If two events do not happen within a single thread of execution, event P happens before event Q if event P is a message send and event Q is the corresponding message receive.
The relation is transitive. Therefore, if event P happens before event Q and event Q happens before event R, then event P happens before R.

The "happens before" relation is a partial ordering: it is possible to have two events P and Q such that neither happens before the other. When this happens, the two events are concurrent. Remember how a lock prevents concurrent accesses to the same data structure? That is because, when you protect a data structure with a lock, all accesses to the data structure form a total ordering, as if they came from a single thread. Lamport's framework also provides a basic idea of what happens when a lock is handed off from a thread to another: some kind of "message passing" ensures that the unlock operation of thread T "happens before" the lock operation of thread U.

As it turns out, this is not just theory: in order to ensure that their caches are coherent, CPUs exchange messages over buses such as Intel's QPI or AMD's HyperTransport. However, this level of detail is definitely too much for our purposes. Instead, we will generalize the "happens before" definition to cover any kind of synchronization primitive.

Lamport's fundamental insight was that synchronization happens when two threads operate with symmetric operations on the same data structure. In our generalization, we will list pairs of operations (such as sending and receiving a message through the same queue) that synchronize one thread with another. Furthermore, we will classify the operations in each pair as either release or acquire; equivalently we can say that they "have release (or acquire) semantics".

Within a pair, the release operation synchronizes with the corresponding acquire operation. "Synchronizing" means that, whenever one thread performs a release operation and another thread performs the corresponding acquire operation, the "happens before" relation grows edges from the releasing thread to the acquiring thread. "Happens before" remains a partial order in the general case, but it now spans two threads, or more than two, thanks to transitivity. More formally:

The ordering of operations is total within a single thread.
If an operation P with release semantics synchronizes with an operation Q with acquire semantics, then operation P happens before operation Q, even if they happen in different threads.
Just like before, the relation is transitive and defines a partial ordering of operations.

The old definition follows by giving release semantics to message send and acquire semantics to message receive. A message send synchronizes the sending thread with the thread (or threads) that receives the message. We can also rephrase our previous findings according to this new definition: for locks to work, unlocking must have release semantics and must synchronize with locking—which in turn has acquire semantics. No matter if the lock is contended or not, the resulting "happens before" edges ensure a smooth hand-off from one thread to the other.

Acquire and release semantics may seem like an abstract concept, but they truly provide simple explanations of many common multithreaded programming practices. For example, consider two user-space threads accessing a global variable s:

    thread 1                              thread 2
    --------------------------------      ------------------------
    s = "hello";
    pthread_create(&t, NULL, t2, NULL);
                                          puts(s);
                                          s = "world";
    pthread_join(t, NULL);
    puts(s);

Are the accesses to the variable safe? Can thread 2 assume that s will read as "hello", and likewise can thread 1 assume that s will be "world" after pthread_join()? The answer is affirmative, and we can explain why in terms of acquire and release semantics:

pthread_create() has release semantics and synchronizes with the start of thread 2 (which has acquire semantics). Therefore, anything written before a thread is created can be safely accessed from the thread.
Exiting thread 2 has release semantics and synchronizes with pthread_join() (which has acquire semantics). Therefore, anything the thread writes before exiting can be safely accessed after pthread_join().

Note that data is flowing from one thread to the other with no lock involved: congratulations, you have made it through the first example of lockless programming. To sum up:

If the programmer wants thread 2 to "see the effects of" everything that happened so far in thread 1, the two threads need to synchronize with each other: this is done with a release operation in thread 1 and an acquire operation in thread 2.
Knowing which APIs provide acquire/release semantics lets you write code that relies on the ordering provided by those APIs.

Having understood how release and acquire semantics work for high-level synchronization primitives, we can now consider them in the context of individual memory accesses.

The message-passing pattern

In the previous paragraph we have seen how the acquire and release semantics of pthread_create() and pthread_join() allow the creator of a thread to exchange information with that thread and vice versa. We will now see how this kind of communication can happen in a lockless manner while threads run.

If the message is a simple scalar value, for example a boolean, it could be read and written directly to a memory location. However, consider what happens if the message is a pointer, as in the following example:

    thread 1                            thread 2
    --------------------------------    ------------------------
    a.x = 1;
    message = &a;                       datum = message;
                                        if (datum != NULL)
                                          printk("%d\n", datum->x);

If message is initially NULL, thread 2 will read either NULL or &a, we don't know which. The problem is that, even if

    datum = message;

were to read &a, that assignment is still not synchronized against the assignment in thread 1:

    message = &a;

Therefore, there is no edge in the happens before relation connecting the two threads:

    a.x = 1;                            datum = message;
       |                                    |
       |   happens before                   |
       v                                    v
    message = &a;                       datum->x

Because the two threads of execution are disconnected, there is still no guarantee that datum->x will read as 1; we don't know that the assignment to a.x happens before that read or not.. For this to happen, the store and load must be endowed with release and acquire semantics respectively.

To that end, we have the "store-release" and "load-acquire" operations. A store-release operation P, in addition to writing to a memory location, synchronizes with a load-acquire operation Q if Q reads the value that was written by P. Here is a fixed version of the above code, using Linux's smp_store_release() and smp_load_acquire():

    thread 1                                  thread 2
    --------------------------------          ------------------------
    a.x = 1;
    smp_store_release(&message, &a);          datum = smp_load_acquire(&message);
                                              if (datum != NULL)
                                                printk("%x\n", datum->x);

With this change, if datum is &a we can affirm that the store happened before the load. (I am assuming for simplicity that only one thread can write &a to message. Saying "thread 2 reads the value written by thread 1" does not refer to the specific bit pattern that goes in memory, it really means that thread 1's store is the last whose effect is visible to thread 2). The relation looks like this now:

    a.x = 1;
       |
       v
    smp_store_release(&message, &a);  ----->  datum = smp_load_acquire(&message);
                                                  |
                                                  v
                                              datum->x

And everything works. Because of transitivity, whenever thread 2 reads the value written by thread 1, everything that thread 1 did up to the store-release will also be visible to thread 2 after the load-acquire. Note that, unlike the pthread_join() case, "synchronizes with" does not mean that thread 2 "waits for" thread 1 to do the write. The above drawing only applies if thread 2 happens to read the value that thread 1 has written.

In the Linux kernel, the above code will often be written in a slightly different manner:

    thread 1                              thread 2
    --------------------------------      ------------------------
    a.x = 1;
    smp_wmb();
    WRITE_ONCE(message, &a);              datum = READ_ONCE(message);
                                          smp_rmb();
                                          if (datum != NULL)
                                            printk("%x\n", datum->x);

In this case, the release and acquire semantics are provided by the memory barriers smp_wmb() and smp_rmb(). Memory barriers also have acquire and release semantics, but they are a bit more complicated to reason about than simple loads and stores. We will get back to them when we talk about seqlocks.

Regardless of whether one uses load-acquire/store-release or smp_rmb()/smp_wmb(), this is an extremely common pattern, and one that should be understood well. Among its uses we find:

All sorts of ring buffers. Each entry of the ring buffer often points to other data; usually, there are also head/tail locations that contain an index in the ring buffer. The producer side will use store-release operations, synchronizing with load-acquire operations in the consumer.
RCU. As far as the compiler is concerned, the familiar rcu_dereference() and rcu_assign_pointer() APIs are similar to load-acquire and store-release operations. Thanks to some assumptions that are true for all processors except the Alpha, rcu_dereference() can be compiled to a regular load; still, rcu_assign_pointer() synchronizes with rcu_dereference() as if it were a load-acquire operation.

Publishing pointers into an array. In this (modified) KVM snippet, if kvm_get_vcpu() sees the incremented kvm->online_vcpus, the associated entry in the array will be valid:

    kvm_vm_ioctl_create_vcpu()                     kvm_get_vcpu()
    -----------------------------------------      -----------------------------------------------
    kvm->vcpus[kvm->online_vcpus] = vcpu;          if (idx < smp_load_acquire(&kvm->online_vcpus))
    smp_store_release(&kvm->online_vcpus,            return kvm->vcpus[idx];
                      kvm->online_vcpus + 1);      return NULL;

Apart from the mechanics of the load-acquire/store-release operations, there is another aspect of the message-passing pattern that you should ponder: it is a single producer algorithm. If there are multiple writers, they must be protected against each other by other means, for example with a mutex. Lockless algorithms do not exist in a void; they are but one part of the concurrent programming toolbox, and they work best when combined with other, more traditional tools.

This is just the beginning of an extended series on lockless algorithms. The next installment will look further at how atomic memory operations can be ordered, and look into how memory barriers are at the heart of both the "seqcounts" mechanism and the Linux scheduler.

Comments (116 posted)

5.12 Merge window, part 1

By Jonathan Corbet
February 22, 2021

The beginning of the 5.12 merge window was delayed as the result of severe weather in the US Pacific Northwest. Once Linus Torvalds got going, though, he wasted little time; as of this writing, just over 8,600 non-merge changesets have been pulled into the mainline repository for the 5.12 release — over a period of about two days. As one might imagine, that work contains a long list of significant changes.

The most interesting changes merged for 5.12 so far include:

Architecture-specific

A number of 32-bit Arm platforms (efm32, picoxcell, prima2, tango, u300, zx, and c6x) have been removed after they turned out to have no users. A long list of associated device drivers has been removed as well.

Core kernel

The BPF instruction set has gained a set of atomic operations; see this documentation commit and this merge message for more information.
BPF programs can now be attached to "bare" tracepoints — those which have no associated trace event visible from user space. Bare tracepoints exist out of the fear that a visible event could eventually become part of the kernel ABI and thus difficult to change. The BPF patch includes a warning that bare tracepoints have no ABI guarantees, but what will actually happen if a bare-tracepoint change breaks user-space code is not entirely clear.
BPF programs can now access data on the stack via pointers with variable offsets; this removes an annoying limitation that developers have had to work around until now. Consider an array on the stack, for example; previously it was only possible to access that array with a constant index, while now a variable index may be used. The verifier has been extended to ensure that such accesses remain within bounds. This relaxation only applies to privileged programs, though, due to fears of speculative-execution exploits.
Support for the "oprofile" profiling subsystem has been removed. Oprofile has not been actively used for some time, having long been supplanted by perf events.
The io_uring subsystem is now integrated with memory control groups so that its memory use is properly accounted for and regulated.
It is now possible to choose between the various scheduler preemption modes (none, voluntary, or full) at boot time if the PREEMPT_DYNAMIC configuration option is selected. There is a knob under debugfs that can be used to switch at run time as well.

Filesystems and block I/O

The LOOKUP_CACHED patches have been merged. These allow user space (along with io_uring) to request that a pathname lookup be done without blocking or not at all.
The Btrfs filesystem has gained support for zoned block devices — sort of. Zoned-device support is clearly a work in progress and is not ready for real use yet.
F2FS now allows the specification of the compression algorithm and level to use; the LZ4 "high compression" mode is now supported.
The new FS_IOC_READ_VERITY_METADATA ioctl() command can be used to read the metadata from files protected by fs-verity. See this commit for details.

Hardware support

Clock: Allwinner H616 clock control units, Qualcomm SDX66 APCS clock controllers, Qualcomm SC8180X, SC7280, and SM8350 global clock controllers, and Qualcomm SDM660 multimedia clock controllers.
Media: OmniVision OV5648 and OV8865 image sensors, MaxLinear MXL692 tuners, IMI RDACM21 GMSL cameras, and Sony IMX334 sensors.
Miscellaneous: Broadcom power-management buses, Yamaha YAS530 3-axis magnetometers, Analog Devices AD5766/AD5767 digital-to-analog converters, Nintendo 64 audio controllers, Ingenic JZ4760 codecs, Khadas TS050 TFT-LCD panels, Google cr50 I2C TPM interfaces, Intel Keem Bay OCS hashing-control units, Marvell OcteonTX2 cryptographic accelerators, Microsoft Surface system aggregator modules, Aosong AHT10 temperature and humidity sensors, Texas Instruments TPS23861 802.3at PoE PSE controllers, Intel Keem Bay SoC non-secure watchdog timers, NVIDIA Tegra QSPI controllers, Acer Iconia Tab A500 embedded controllers, Qualcomm ADC5 SPMI PMIC thermal monitors, Silvaco dual-role master I3C controllers, and Toshiba Visconti GPIO controllers.
Networking: Arrow SpeedChips XRS7003/7004 gigabit Ethernet switches, Broadcom BCM4908 internal MACs, MediaTek MT7921E wireless interfaces, and Toshiba Visconti MACs.
Power supply: TI BQ256XX battery chargers, Analog Devices LTC4162-L chargers, and Acer Iconia Tab A500 batteries.
Regulator: Richtek RT4831 DSV regulators, Actions Semi ATC260x PMIC regulators, MediaTek DVFSRC regulators, and MediaTek MT6315 PMIC regulators.
USB: Cadence dual-role USB controllers and USB MaxLinear/Exar USB to serial converters.
The kernel now supports dynamic thermal power management via a subsystem that allows the power usage of groups of devices to be capped to meet thermal constraints. See this documentation commit for more information.

Networking

The rapidly developing multipath TCP implementation has gained the ability to attach a priority to specific subflows; some can, for example, be marked as only being for backup use should the primary flow(s) fail.
The IGMPv3 subsystem has gained "explicit host tracking" support; see this merge message for a few details.
The threaded NAPI polling patches have been merged; this work can improve performance for some workloads. There is a new sysfs knob that can be used to control threaded polling; see this commit for details.
The netfilter packet-filtering mechanism now supports the idea of "ownership" of specific tables. This allows, for example, a firewall daemon to maintain exclusive control of the tables it manages.

Security-related

The integrity measurement architecture (IMA) subsystem can now "measure" various bits of data within the kernel itself to ensure that they have not been tampered with. It can, for example, check the current SELinux policy for changes.

Virtualization and containers

The KVM subsystem has gained the ability to intercept Xen hypercalls and pass them to user space for emulation.

If the normal schedule stays in place, the 5.12 merge window can be expected to close on February 28. It is not clear, at this point, whether the loss of nearly one week of testing time will lead to an extension of the merge window or not. Regardless, stay tuned for our summary of the remainder of the merge window, to be posted after the window closes, whenever that may be.

Comments (5 posted)

NumPy 1.20 has been released

February 23, 2021

This article was contributed by Lee Phillips

NumPy is a Python library that adds an array data type to the language, along with providing operators appropriate to working on arrays and matrices. By wrapping fast Fortran and C numerical routines, NumPy allows Python programmers to write performant code in what is normally a relatively slow language. NumPy 1.20.0 was announced on January 30, in what its developers describe as the largest release in the history of the project. That makes for a good opportunity to show a little bit about what NumPy is, how to use it, and to describe what's new in the release.

What is NumPy?

NumPy has been a big part of what makes scientific Python possible. It is the library at the core of SciPy and most other science or engineering ecosystems for Python. It allows researchers to use a scripting language with a friendly syntax for the analysis of data from simulations and experiments. It can even allow them to run many types of simulations in Python itself. NumPy is largely responsible for popularizing the idea that Python can be used for the kind of work that, previously, had been the almost exclusive domain of Fortran. This has had a cascading effect, leading to the wider appreciation of open-source tools and libraries by the scientific community, and to the emergence of new modes of sharing results; in particular, Jupyter notebooks.

NumPy is released under the BSD 3-Clause License and is developed on GitHub.

NumPy adds a new data type to Python: the multidimensional ndarray. This a container, like a Python list, but with some crucial differences. A NumPy array is usually homogeneous; while the elements of a list can be of various types, an ndarray will, typically, only contain a single, simple type, such as integers, strings, or floats. However, these arrays can instead contain arbitrary Python objects (i.e. descendants of object). This means that the elements will, for simple data types, all occupy the same amount of space in memory. The elements of an ndarray are laid out contiguously in memory, whereas there is no such guarantee for a list. In this way, they are similar to Fortran arrays. These properties of NumPy arrays are essential for efficiency because the location of each element can be directly calculated.

Beyond just adding efficient arrays, NumPy also overloads arithmetic operators to act element-wise on the arrays. This allows the Python programmer to express computations concisely, operating on arrays as units, in many cases avoiding the need to use loops. This does not turn Python into a full-blown array language such as APL, but adds to it a syntax similar to that incorporated into Fortran 90 for array operations.

Before delving into more detailed examples in the next section, let's take a brief look at how this works. Suppose we have a Python program with two lists that we want to add together, element by element, to make a new list. Simple code to do so might look something like this:

    >>> a = [1, 2, 3]
    >>> b = [4, 5, 6]
    >>> c = []

    >>> for i in range(len(a)):
    ... c.append(a[i] + b[i])

After this code is run, c will be [5, 7, 9].

The equivalent code using NumPy will look like this:

    >>> import numpy as np     # the conventional NumPy import
    
    >>> a = np.array([1, 2, 3])
    >>> b = np.array([4, 5, 6])

    >>> c = a + b

The library overloads the "+" operator to act element-wise on the NumPy array type, so no looping is required. The result, c, will not be a list but another ndarray:

    >>> c
    array([5, 7, 9])

    >>> type(c)
    numpy.ndarray

The same syntax can be used for arrays of any dimension. Element-wise multiplication of two matrices is accomplished in the obvious way:

    >>> a = np.array([[1, 2], [3, 4]])
    >>> b = np.array([[2, 2], [2, 2]])  

    >>> a * b

    array([[2, 4],
           [6, 8]])

But what if we want matrix multiplication, rather than element-wise multiplication? NumPy also implements the binary operator "@" for that:

    >>> a @ b

    array([[ 6,  6],
           [14, 14]])

    >>> b @ a

    array([[ 8, 12],
           [ 8, 12]])

The above examples show how NumPy extends Python arithmetic to operate on entire arrays, but the library offers far more than this. There is a complete set of facilities for creating and manipulating arrays, including transposing, rotating, inverting, joining, and selecting sub-arrays; a sub-library for Fourier transforms; a set of linear algebra routines including such things as computing eigenvalues, determinants, and solving systems of linear equations; and more. NumPy also includes a subsystem called F2PY for creating interfaces between Python scripts and Fortran subroutines.

Much of this capability, especially in the realm of linear algebra, is provided by the widely-used BLAS and LAPACK libraries, for which NumPy acts as a convenient wrapper for the Python programmer. The user has the choice of using the versions supplied with NumPy or installing [147-page PDF] a hardware-specialized version for maximum efficiency.

NumPy performs admirably when used with standard array calculations that fit the patterns that it is designed to optimize, such as array multiplication or element-wise arithmetic. Its advantages are limited to such calculations, however, and this can be considered as the boundary that delimits the practical usefulness of Python for scientific computing.

More complex algorithms that cannot be expressed as a sequence of simple array operations, and therefore require inner loops and conditionals, will run at the speed of the Python interpreter, and not scale to large problems. The user then has several choices: to turn to a language designed for high performance computing, such as Fortran or Julia, to resort to alternative implementations of Python, such as Numba, PyPy, or Cython, or to develop in two languages, using Python as glue and a fast language for the numerically intensive parts. Here is a comparison of some of these approaches, which also serves as a brief tutorial, with examples.

Examples

In order to get the recent release of NumPy, the easiest route is probably to use pip to get it from the Python Package Index (PyPI). As usual with Python, the standard recommendation is to do this installation, and the installation of anything else to be used with numpy, such as IPython, in a virtual environment.

In my recent article about SciPy, I demonstrated the use of its image-processing routines to show some of what it can do. Under the hood, all of these image transformations are accomplished by converting the image to a matrix of numbers and using NumPy to transform the matrix in various ways. It will be instructive to carry out some similar image transformations now, not by using the image-processing library, but by using NumPy array manipulations directly. This will demonstrate some of the things that can be done with NumPy and its arrays in a way that makes the operations more intuitive because of the direct visual results.

First we need to import numpy and the plotting library (needed to display the results as images) as well as read in an image file and store it as a NumPy array. All the examples in this article use NumPy 1.20.0 and Python 3.9.1.

    >>> import numpy as np
    >>> import matplotlib.pyplot as plt

    >>> img = plt.imread("bacall.jpg")

Now img will be a NumPy array. We can check this, and also get its dimensions:

    >>> type(img)
    numpy.ndarray

    >>> np.shape(img)
    (550, 440)

The image has dimensions 550 by 440. We can prepare this image using the following code:

    >>> plt.imshow(img, cmap='gray')

That tells plt that we would like the intensities mapped to a grayscale palette. Now we must either call plt.savefig() to save the image in a file or plt.show() to display it on the screen. We'll omit these calls, from here on, for brevity, just showing the matrix transformations.

Looking at the effect on images is a good way to check whether our mental model of certain matrix operations is correct. The image below shows a collection of operations done on the original image. NumPy has a built-in transpose() operator, that switches axes. "Transposed" is the result of np.transpose(img). Certain operations are easier to show than to explain in words. See the "Rolled" image for the result of np.roll(img, 200). Combining flip(), which does what it sounds like, with hstack(), which concatenates arrays horizontally, with np.hstack((img, np.flip(img))) generates the "Flipped" image; note that hstack() takes a tuple argument, thus the "extra" parentheses.

We could implement more sophisticated image processing, such a blurring, filtering, edge detection, and so on, by combinations of Fourier transforms, numerical derivatives, and other straightforward operations on the image matrix using NumPy.

The new release

There are several aspects to the new release of NumPy that will be significant for current users; some of the changes also give indications about NumPy's intentions for the future. First of all, NumPy 1.20 requires at least Python 3.7; no earlier version of Python will work. Beyond that, removing compatibility with Python 2 allowed additional code cleanup and elimination of what the release notes called "technical debt".

One class of techniques that NumPy uses to speed up Python code is to perform certain calculations in parallel, when the opportunity is afforded by the hardware that the code is running on. There are many forms of parallel processing. The general category that NumPy implements is single instruction, multiple data (SIMD) parallelism. There are two kinds of SIMD calculation, and both involve performing arithmetic on different elements of an array simultaneously. The first kind is vectorization, where a single CPU (or core) carries out an arithmetic operation on a small number of elements, typically four or eight, in parallel, continuing until the entire array is processed. The second kind of SIMD parallelism divides the array among a set of processors, from the small handful found on a laptop to the thousands of processors on a supercomputer. This is the form of parallelism provided by GPU array processors.

NumPy takes advantage of both types of SIMD parallelism when possible. Multicore, distributed parallel computation is provided by the BLAS routines wrapped by NumPy; whenever a NumPy operation calls a routine from this library, it takes over the parallelization of the calculation, and will utilize all of the processing units available, transparently to the user.

For example, matrix multiplication using the @ operator calls, on my machine, an OpenBLAS routine that uses all the cores on my laptop. If I had installed NumPy using Anaconda, rather than pip, the same Python code would call out to the Intel MKL library, which also implements distributed SIMD parallelism.

The typical user need not be aware of any of this, but sophisticated users have always been able to choose the best fit for their target hardware from among several available linear algebra libraries, and compile a custom version of NumPy that links to it. The new release of NumPy extends similar flexibility and control into the realm of vectorization. The main advance is, again, something that most people need not be aware of, and the only visible change to most users will be a possible increase in performance. This increased performance is due to the new ability of NumPy to detect, at runtime, which vectorization instructions are provided by the CPU, and choose a code path that takes advantage of them. The same line of Python will now execute different code, depending on what NumPy can detect about the hardware. One side effect of this is that NumPy is a bit bigger, because the compiled library now contains code paths for different CPU architectures.

As part of this initiative, there are several new build arguments for use in compiling custom versions of NumPy. These allow the user to select from among various vectorization optimizations, to help in tuning NumPy to a particular architecture, or even to a particular computation pattern. To aid this even further, there is now a #define called NPY_DISABLE_OPTIMIZATION that can be used to turn off optimizations for particular sections of the NumPy source code.

Another significant new feature in NumPy 1.20 is the type annotations. Optional static type annotations were added to Python with PEP 484, an addition to the language that was not without controversy. These type hints can be used by linters, type checkers, and development environments, especially to assist with code completion and refactoring.

NumPy now embraces PEP 484 type annotations in two ways. The NumPy library code is now annotated with type information. This had been discussed before, but dropping Python 2 support in the latest release of NumPy cleared the way for adding type annotations without the awkwardness of supporting Python versions where they can't be used. An examination of the files in the NumPy package that I installed shows that annotations have been implemented as stub files.

The second new feature related to type annotations is the addition of new types, defined and implemented in a new NumPy submodule called numpy.typing. The two new types mentioned in the documentation are ArrayLike and DTypeLike. The rationale is to enable the use of type checkers to help the user avoid patterns that, while legal in NumPy, would be inefficient. While using arbitrary objects in an ndarray is legal, it does not generally work with BLAS and other libraries that expect simple values. The new static types can be used to warn the user if they create such data structures.

In addition to the two new types mentioned in the documentation, the typing module contains definitions for a handful of others, such as FloatLike, ComplexLike, and ShapeLike. NumPy type annotations are still at an early phase and users can expect them to evolve significantly.

Another addition is a new keyword argument, where, in the numpy.all() and numpy.any() functions. These functions return True or False if any or all elements of an array satisfy a condition. For example, suppose we define an array containing some random integers this way:

    >>> ri = np.random.randint(0, 20, 8)
    >>> ri
    array([ 9,  7,  6,  2, 14, 13, 19, 16])

The random.randint() function takes a minimum, maximum, and an output shape, and returns an array of uniformly distributed random integers in the given range. We can ask if any or all of the integers are less than three:

    >>> np.any(ri < 3)
    True
    >>> np.all(ri < 3)
    False

Using the new argument, we can limit which elements are looked at in the test. The where argument must be set to an array of Boolean values, which is used as a mask to pick out the elements to be tested. NumPy makes it easy to create such logical masks with a simple Boolean statement. Here we create a mask that maps which elements are even, using the ndarray modulus operator:

    >>> ri%2 == 0
    array([False, False, True, True, True, False, False, True])

Now we can ask "are any of the even (or odd) numbers in the array less than three?":

    >>> np.any(ri < 3, where=ri%2 == 0)
    True
    >>> np.any(ri < 3, where=ri%2 == 1)
    False

This new argument leads to more succinct code, and more opportunities for parallelization in certain circumstances. It can also be applied to three statistical functions: mean(), std(), and var(), for the mean, standard deviation, and variance, as, for example, a concise way to eliminate outliers from the computation of these distribution parameters.

The new function sliding_window_view() provides a convenient way to generate a moving window over an array for doing such things as creating moving averages. This is quite convenient, since the alternatives using index arithmetic can lead to messy code. A simple example will serve better to show how this works than an attempt to explain it in words:

    >>> a = np.array([1, 2, 3, 4, 5, 6, 7])
    >>> np.lib.stride_tricks.sliding_window_view(a, 3)
    array([[1, 2, 3],
	   [2, 3, 4],
	   [3, 4, 5],
	   [4, 5, 6],
	   [5, 6, 7]])

As the example shows, this lives in the stride_tricks module inside the NumPy lib module. The documentation cautions that this function can be quite slow; it is intended for prototyping or for small arrays.

Many of the changes in the new release are deprecations involving moving functions, either from the top-level numpy namespace into submodules, or up from submodules into the top level, and the spelling of data types. Examples are the financial functions (present and future value, rates of return, etc.), now moved into their own numpy-financial package; the functions in numpy.dual have now moved into the top level.

The spellings of some data types have also been changed: type codes such as Uint32, that begin with upper-case letters, are now spelled uint32; the former style is deprecated. Certain data types that were denoted as, for example, numpy.int are now to be called simply int; see this table for a complete list. There were some possibly confusing notations for complex number data types, previously deprecated, whose deprecation periods have expired with the new release. These were data types such as Complex64, where the "64" referred to the size of the real or imaginary part separately. Users must now use complex128 instead of Complex64.

Historically, deprecations in NumPy have had long sentences in purgatory, at least one that lasted 14 years. Nevertheless, the reorganization that is part of this large release might be a good opportunity for cleaning up deprecated references in existing code bases.

The scale of the work devoted to the new release of NumPy, especially in the area of SIMD optimization, shows that its large community of developers see a continuing future for Python as a partner in scientific research and engineering calculation. There are many excellent alternatives for high-performance scientific numerics, from Julia to Fortran. Part of what Python brings to the table is the network effect of its large user base, and the existence of many solid scientific libraries. The consequence is that there is a good chance that a computational scientist today will wind up programming using NumPy. It's clear that this infrastructure is on solid ground, has a lot of talent behind it, and will only continue to improve.

Comments (18 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Google memory-safety mitigation effort; Hibernation under kernel lockdown; Kodi 19; Firefox 86; Quotes ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>