LWN.net Weekly Edition for May 13, 2021

Welcome to the LWN.net Weekly Edition for May 13, 2021

This edition contains the following feature content:

Holes in the WiFi: a new set of vulnerabilities seemingly affects all WiFi devices out there.
Pyodide: Python for the browser: turning the browser into a welcoming environment for scientific computing.
A pair of memory-allocation improvements in 5.13: batch page allocation and hugepage-backed vmalloc().
Noncoherent DMA mappings: a new API for better DMA-related performance on some systems.
The second half of the 5.13 merge window: the rest of the changes merged for 5.13.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Holes in the WiFi

By Jake Edge
May 12, 2021

The discoverer of the KRACK attacks against WPA2 encryption in WiFi is back with a new set of flaws in the wireless-networking protocols. FragAttacks is a sizable group of WiFi vulnerabilities that (ab)use the fragmentation and aggregation (thus "Frag") features of the standard. The fixes have been coordinated over a nine-month period, which has allowed security researcher Mathy Vanhoef time to create multiple papers, some slide decks, a demo video, patches, and, of course, a web site and logo for the vulnerabilities.

Three of the vulnerabilities are design flaws in the WiFi standards, so they are likely present in all implementations, while the other nine are various implementation-specific problems. The design flaws may be more widespread, but they are much harder to exploit "because doing so requires user interaction or is only possible when using uncommon network settings". That means the real danger from FragAttacks lies in the programming errors in various WiFi implementations. "Experiments indicate that every Wi-Fi product is affected by at least one vulnerability and that most products are affected by several vulnerabilities."

In fact, in the FAQ section of the web site, Vanhoef offers to list any products that he can verify as not having been affected by the flaws described on the site. He also notes that even though the design flaws are difficult to exploit on their own, they can be combined with the other flaws found to make for a much more serious problem. "In other words, for some devices the impact is minor, while for others it's disastrous."

Fragging

As the names would imply, fragmentation and aggregation refer to ways that wireless "frames" can be split apart or coalesced depending on various constraints; large frames can be fragmented for reliability purposes, while smaller frames can be aggregated for better network throughput. Fragmentation problems that he noticed in Linux while working on the KRACK attacks in 2017 were what drew Vanhoef's attention to the problems, but he put off looking into it more closely until 2020.

Fast-forward three years later, and after gaining some additional ideas to investigate, closer inspection confirmed some of my hunches and also revealed that these issues were more widespread than I initially assumed. And with some extra insights I also discovered all the other vulnerabilities.

Aggregation is indicated in a frame with an "is aggregated" flag, but that flag is not protected ("authenticated" is the term Vanhoef uses) with the rest of the header, so an adversary can change its value without invalidating the frame. If the attacker can trick the victim into connecting to a dodgy server, they can cause the victim to process the encrypted data in an unintended way. That can lead to injecting arbitrary network packets into the victim's system by setting the aggregation flag for carefully selected frames. In the demo, that flaw is used to cause the victim to use a malicious DNS server.

The fix for that problem is obvious: add the flag to the protected portion of the frame. Ironically, the standard already has a way to do so, but it is not implemented by devices. Herein lies a lesson for those implementing "secure" systems:

Unfortunately, many products already implemented a draft of the 802.11n amendment, meaning this problem had to be addressed in a backwards-compatible manner. The decision was made that devices would advertise whether they are capable of authenticating the "is aggregated" flag. Only when devices implement and advertise this capability is the "is aggregated" flag protected. Unfortunately, in 2020 not a single tested device supported this capability, likely because it was considered hard to exploit. To quote a remark made back in 2007: "While it is hard to see how this can be exploited, it is clearly a flaw that is capable of being fixed."
In other words, people did notice this vulnerability and a defense was standardized, but in practice the defense was never adopted. This is a good example that security defenses must be adopted before attacks become practical.

Fragmentation is rarely enabled by devices, so the two design flaws found there have even less impact. Each fragment that belongs to the same frame is encrypted using the same key, but receivers are not required to ensure that is the case and will reassemble frames from fragments encrypted with different keys. "Under rare conditions this can be abused to exfiltrate data."

In addition, WiFi devices are not required to flush fragments that they have received—but not yet reassembled while waiting for additional fragments—from memory when a client disconnects from the network. An attacker can "preload" the device with some fragments and disconnect in anticipation of the victim connecting. If the victim uses fragmentation, "which appears uncommon in practice", the flaw can be used to exfiltrate data as well.

In both of these fragmentation cases, the fix is for devices to be more proactive than the standard requires. The device should ensure that all fragments are encrypted with the same key before allowing them to be reassembled and processed further. Likewise, fragments for incomplete frames should be flushed from memory when the client disconnects. Both seem like prudent "defensive programming" measures, at least in hindsight.

More flaws

The overview of the rest of the flaws shows how the different pieces can come together and lead to further mayhem:

Some routers will forward handshake frames to another client even when the sender hasn't authenticated yet. This vulnerability allows an adversary to perform the aggregation attack, and inject arbitrary frames, without user interaction.
Another extremely common implementation flaw is that receivers do not check whether all fragments belong to the same frame, meaning an adversary can trivially forge frames by mixing the fragments of two different frames.
Additionally, against several implementations it is possible to mix encrypted and plaintext fragments.
Finally, some devices don't support fragmentation or aggregation, but are still vulnerable to attacks because they process fragmented frames as full frames. Under the right circumstances this can be abused to inject packets.

Home networks are particularly vulnerable to the flaws and, given the spotty record of updates for many home-network devices, these kinds of problems may sadly persist for years to come. The demo (YouTube video) shows three examples of how the flaws can be exploited in that kind of environment:

First, the aggregation design flaw is abused to intercept sensitive information (e.g. the victim's username and password). Second, it's shown how an adversary can exploit insecure internet-of-things devices by remotely turning on and off a smart power socket. Finally, it's demonstrated how the vulnerabilities can be abused as a stepping stone to launch advanced attacks. In particular, the video shows how an adversary can take over an outdated Windows 7 machine inside a local network.

In all, 12 separate CVEs were issued for the flaws: three for the design flaws, four for vulnerabilities that "allow the trivial injection of plaintext frames in a protected Wi-Fi network", and five for other implementation bugs. The response to the flaws, which was coordinated by the Wi-Fi Alliance and the Industry Consortium for Advancement of Security on the Internet (ICASI), followed a somewhat different strategy in assigning the CVE numbers:

Although each affected codebase normally receives a unique CVE, the agreement between affected vendors was that, in this specific case, using the same CVE across different codebases would make communication easier. For instance, by tying one CVE to each vulnerability, a customer can now ask a vendor whether their product is affected by a specific CVE. Please note that this deviates from normal MITRE guidelines, and that this decision was made by affected vendors independently of MITRE, and that this in no way reflects any changes in how MITRE assigns CVEs.

Reading between the lines might indicate that MITRE and/or the CVE board were less than entirely pleased by that approach. Of late, the board has been rather protective of the CVE-issuance process. Balancing the needs of all of the disparate CVE users and consumers has been an ongoing problem, part of which we looked at in early April.

Meanwhile, the Linux networking developers, including Vanhoef, have come up with a patch set to address the vulnerabilities in the kernel. Some are being fixed in the mac80211 core, while others are being handled in the drivers. More fixes may be coming for other drivers and, potentially, the core as well. Beyond that, firmware updates are needed for some hardware; the firmware for some hardware has been updated to patch the vulnerabilities (silently, at least for Intel firmware).

FragAttacks is a whole passel of vulnerabilities, for sure, but it is a little unclear how serious of a problem they will pose in the real world. That should not lead one to neglect updating devices, however. Unfortunately, WiFi implementations are often deployed in equipment that sees little or no maintenance—if it can be maintained at all. That reason alone should lead to more scrutiny and testing of the security of both the standards and the implementations. But it seems likely we will see another batch or three of WiFi holes as time goes on.

Comments (24 posted)

Pyodide: Python for the browser

By Jake Edge
May 11, 2021

Python in the browser has long been an item on the wish list of many in the Python community. At this point, though, JavaScript has well-cemented its role as the language embedded into the web and its browsers. The Pyodide project provides a way to run Python in the browser by compiling the existing CPython interpreter to WebAssembly and running that binary within the browser's JavaScript environment. Pyodide came about as part of Mozilla's Iodide project, which has fallen by the wayside, but Pyodide is now being spun out as a community-driven project.

History

Iodide, introduced in 2019, was an effort to create an in-browser notebook for scientific exploration and visualization, akin to Jupyter and JupyterLab. In that introductory post, the mismatch between JavaScript and scientific computing was noted—most of the existing ecosystem is Python-based—which is where the idea for Pyodide came from:

When we started thinking about making the web better for scientists, we focused on ways that we could make working with Javascript better, like compiling existing scientific libraries to WebAssembly and wrapping them in easy to use JS APIs. When we proposed this to Mozilla’s WebAssembly wizards, they offered a more ambitious idea: if many scientists prefer Python, meet them where they are by compiling the Python science stack to run in WebAssembly.

Getting that working seemed like it would be a big project, but it took only two weeks for Mike Droettboom to get Python running in an Iodide notebook. Over the following months, he and others added support for the most popular Python scientific packages (many of which are implemented in C like the CPython interpreter), such as NumPy, Matplotlib, pandas, SciPy, and scikit-learn. There were concerns about the performance of Pyodide, but performance turned out not to be a barrier for the use case:

Running the Python interpreter inside a Javascript virtual machine adds a performance penalty, but that penalty turns out to be surprisingly small — in our benchmarks, around 1x-12x slower than native on Firefox and 1x-16x slower on Chrome. Experience shows that this is very usable for interactive exploration.

A month after the introduction of Iodide, Droettboom posted more details about Pyodide. It is built using Emscripten, which provides a way to compile C and C++ to WebAssembly, along with "a compatibility layer that makes the browser feel like a native computing environment". That layer is necessary for a tool like Python:

If you were to just take this WebAssembly and load it in the browser, things would look very different to the Python interpreter than they do when running directly on top of your operating system. For example, web browsers don’t have a file system (a place to load and save files). Fortunately, emscripten provides a virtual file system, written in JavaScript, that the Python interpreter can use. By default, these virtual “files” reside in volatile memory in the browser tab, and they disappear when you navigate away from the page. (emscripten also provides a way for the file system to store things in the browser’s persistent local storage, but Pyodide doesn’t use it.)
By emulating the file system and other features of a standard computing environment, emscripten makes moving existing projects to the web browser possible with surprisingly few changes.

In order to do useful work within the browser environment, programs need access to the Document Object Model (DOM) of the browser, which is provided by the JavaScript APIs available there. That means Python and JavaScript code need to work together in various ways. Most of the basic types (e.g. numbers, arrays, strings) can be relatively easily mapped between the languages, but Python treats the object and dict types as distinct, while JavaScript conflates them to a certain extent: all objects can be treated as dicts. To handle that difference, object types in both languages and dict types in Python are represented by proxies in the other language; that allows each language full access to the other's data types.

As an example of these proxies, the post talks about accessing the DOM object from Python as follows:

    from js import document
    x = document.getElementById("myElement")

The document object is a JsProxy type that handles the dispatch to the proper API call without Pyodide being directly involved:

All of this happens through proxies that look up what the document object can do on-the-fly. Pyodide doesn’t need to include a comprehensive list of all of the Web APIs the browser has.

Given the focus on large data sets in the kinds of processing often done by NumPy and the like, there is a need to provide an efficient multi-dimensional array type that can largely be shared between JavaScript and Python. Copying huge arrays between the two would be a major performance hit, but will also likely overtax the amount of memory available to the browser. So an array type that is stored on the heap shared between the two was created; the small, language-specific dab of metadata describing the array is all that needs to be copied.

Present

Fast-forward two years, and Pyodide has made a good deal of progress; it released version 0.17 on April 22. At the same time, it announced that the project is becoming independent, complete with a GitHub repository, a governance system that draws from that of CPython, and a code of conduct based on Rust's and incorporating pieces from others. Meanwhile, the Iodide project has been discontinued, but Pyodide is more than simply a part of Iodide at this point: "Pyodide has attracted a large amount of interest from the community, remains actively developed, and is used in many projects outside of Mozilla."

The 0.17 release has a number of interesting features. Support for Python asyncio has been added, so that Python coroutines can run in the browser event loop; a JavaScript Promise can be awaited in Python and vice versa with Python awaitables. Error handling has also been upgraded so that exceptions generated by Python can be caught in JavaScript; that can be done in the other direction, as well.

A new version of Emscripten has been adopted, which has helped shrink the size of the binaries needed ("the SciPy package shrank dramatically from 92 MB to 15 MB"). It has also helped on the performance side of things. Using the latest toolchain (including the LLVM backend) results in a 25-30% improvement in run times. Overall, performance since the 2019 announcement has improved greatly: "Performance ranges between near native to up to 3 to 5 times slower, depending on the benchmark."

As with most (all?) projects, Pyodide is looking for people interested in contributing in various ways. "There are many ways to contribute, including code contributions, documentation improvements, adding packages, and using Pyodide for your applications and providing feedback." There is a roadmap with some ideas of plans for the future.

As Mozilla's Dan Callahan said in a PyCon 2018 keynote, Python may be getting left behind because the platforms that people are using are changing. While laptops, desktops, and servers support Python just fine, many people are only using phones and tablets where Python is not really present. But there is a unifying platform across all of those devices (and, presumably, others that come down the road): the web. If Python wants to stay relevant, it needs a reasonable browser story; Pyodide may be part of the right path toward that end.

Comments (13 posted)

A pair of memory-allocation improvements in 5.13

By Jonathan Corbet
May 6, 2021

Among the many changes merged for 5.13 can be found performance improvements throughout the kernel. This work does not always stand out the way that new features do, but it is vitally important for the future of the kernel overall. In the memory-management area, a couple of long-running patch sets have finally made it into the mainline; these provide a bulk page-allocation interface and huge-page mappings in the vmalloc() area. Both of these changes should make things faster, at least for some workloads.

Batch page allocation

The kernel's memory-allocation functions have long been optimized for performance and scalability, but there are situations where that work still has not achieved the desired results. One of those is high-speed networking. Back in 2016, networking developer Jesper Dangaard Brouer described the challenges that come with the fastest network links; when the system is processing tens of millions of packets per second, the time available to deal with any given packet is highly constrained. The kernel may only have a few hundred CPU cycles available to process each packet, and obtaining a page from the memory allocator may, by itself, require more than that. Using the entire CPU-time budget to allocate memory is not the way to get the best network performance.

At the time, Brouer asked for an API that would allow numerous pages to be allocated with a single call, hopefully with a much lower per-page cost. The networking code could then grab a pile of memory, and quickly hand out pages as needed. Nobody objected to the request at the time; it is well understood that batching operations can increase throughput in situations like this. But it took some time for that interface to come around.

Mel Gorman took on that task and put together a patch series, the sixth version of which was posted and taken into the -mm tree in March. It adds two new interfaces for the allocation of single ("order-0") pages, starting with:

    unsigned long alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages,
    				   struct list_head *list);

The allocation flags to use are stored in gfp, nr_pages is the number of pages the caller would like to allocate, and list is a list onto which the allocated pages are to be put. The return value will be the number of pages actually allocated, which could be less than nr_pages for any of a number of reasons. The page structures for the allocated pages are assembled into a list (using the lru entry) and attached to the provided list.

Returning the pages in a linked list may seem a little strange, especially since "linked lists" and "scalability" tend not to go together well. The advantage of this approach is that it does not require allocating any memory to track the allocated pages. Since the list is unlikely to be traversed (there is never a need to walk through the list as a whole), the scalability issues do not apply here. Still, this interface may seem awkward to some. For those who would rather supply an array to be filled with pointers, a different interface is available:

    unsigned long alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages,
    					 struct page **page_array);

This function will store pointers to the page structures for the allocated pages into page_array, which should really be at least nr_pages elements long or unpleasant side effects may appear. Interestingly, pages will only be allocated for NULL entries in page_array, so alloc_pages_bulk_array() can be used to refill a partially emptied array of pages. This array, thus, must be zeroed before the first call to alloc_pages_bulk_array().

For users needing more control, the function under the hood that does the work of both alloc_pages_bulk() and alloc_pages_bulk_array() is:

    unsigned int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
				    nodemask_t *nodemask, int nr_pages,
				    struct list_head *page_list,
				    struct page **page_array);

The additional parameters control the location of the allocated pages on a NUMA system; preferred_nid is the node to be used if possible, while nodemask, if present, indicates the allowable set of nodes. Exactly one of page_list and page_array should be non-NULL and will be used to return the allocated pages. If both are supplied, page_array will be used and page_list will be ignored.

Benchmarks included with the patch set show a nearly 13% speed increase for the high-speed networking case, and something closer to 500% for a Sun RPC test case. Gorman noted, though, that: "Both potential users in this series are corner cases (NFS and high-speed networks) so it is unlikely that most users will see any benefit in the short term." The Sun RPC and networking uses have gone directly into 5.13; others are likely to follow.

Huge-page vmalloc()

Most kernel memory-allocation functions return pointers to either pages or addresses in the kernel's address map; either way, the addresses correspond to the physical address of the memory that has been allocated. That works well for small allocations (one page or below), but physical memory allocations become harder to satisfy as the size of the allocation increases due to the fragmentation of memory over time. For this reason, much work has been done over the years to avoid the need for multi-page allocations whenever possible.

Sometimes, though, only a large, contiguous region will do; the vmalloc() interface exists to serve that need. The pages allocated by vmalloc() will (probably) be scattered around physical memory, but they will be made virtually contiguous by mapping them into a special part of the kernel's address space. Traditionally, excessive use of vmalloc() was discouraged due to the costs of setting up the mappings and the small size of the dedicated address space on 32-bit systems. The address-space limitation is not a problem on 64-bit systems, though, and use of vmalloc() has been growing over time.

Addresses in the vmalloc() range are slower to use than addresses in the kernel's direct mapping, though, because the latter are mapped using huge pages whenever possible. That reduces pressure on the CPU's translation lookaside buffer (TLB), which is used to avoid resolving virtual addresses through the page tables. Mappings in the vmalloc() range use small ("base") pages, which are harder on the TLB.

As of 5.13, though, vmalloc() can use huge pages for suitably large allocations thanks to this patch from Nicholas Piggin. For vmalloc() allocations that are larger than the smallest huge-page size, an attempt will be made to use huge pages rather than base pages. That can improve performance significantly for some kernel data structures, as Piggin described:

Several of the most [used] structures in the kernel (e.g., vfs and network hash tables) are allocated with vmalloc on NUMA machines, in order to distribute access bandwidth over the machine. Mapping these with larger pages can improve TLB usage significantly, for example this reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%, due to vfs hashes being allocated with 2MB pages.

There are some potential disadvantages, including wasting larger amounts of memory due to internal fragmentation; a 3MB allocation may be placed into two 2MB huge pages, for example, leaving 1MB of unused memory at the end. It is also possible that the distribution of memory across NUMA systems may be less balanced when larger pages are used. Some vmalloc() callers may be unprepared for huge-page allocations, so they are not done everywhere; in particular, the module loader, which uses vmalloc() and could probably benefit from huge pages, does not currently use them.

Still, the advantages of using huge pages for vmalloc() would appear to outweigh the disadvantages, at least in the testing that has been done so far. There is a new command-line parameter, nohugevmalloc=, which can be used to disable this behavior if need be.

Most users are unlikely to notice any amazing speed improvements resulting from these changes. But they are an important part of the ongoing effort to optimize the kernel's behavior wherever possible; a long list of changes like this is the reason why Linux performs as well as it does.

Comments (14 posted)

Noncoherent DMA mappings

By Jonathan Corbet
May 7, 2021

While it is sometimes possible to perform I/O by moving data through the CPU, the only way to get the required level of performance is usually for devices to move data directly to and from memory. Direct memory access (DMA) I/O has been well supported in the Linux kernel since the early days, but there are always ways in which that support can be improved, especially when hardware adds some challenges of its own. The somewhat confusingly named "non-contiguous" DMA API that was added for 5.13 shows the kinds of things that have to be done to get the best performance on current systems.

DMA, of course, presents a number of interesting race conditions that can arise in the absence of an agreement between the CPU and the device over who controls a range of memory at any given time. But there is another problem that comes up as well. CPUs aggressively cache memory contents to avoid the considerable expense of actually going to memory for every reference. But if a CPU has cached data that is subsequently overwritten by DMA, the CPU could end up reading incorrect data from the cache, resulting in data corruption. Similarly, if the cache contains data written by the CPU that has not yet made it to memory, that data really needs to be flushed out before the device accesses that memory or bad things are likely to happen.

The x86 architecture makes life relatively easy (in this regard, at least) for kernel developers by providing cache snooping; CPU caches will be invalidated if a device is seen to be writing to a range of memory, for example. This "cache-coherent" behavior means that developers need not worry about cache contents corrupting their data. Other architectures are not so forgiving. The Arm architecture, among others, will happily retain cache contents that no longer match the memory they are allegedly caching. On such systems, developers must take care to manage the cache properly as control of a DMA buffer is passed between the device and the CPU.

There are a number of ways to handle this task, but life gets harder if a DMA buffer also requires extensive access by either the kernel or user space. One approach that is taken at times is to make that range of memory uncached. A nonexistent cache cannot corrupt data, but it can make it clear why caches exist in the first place; accessing uncached memory can be extremely slow. If at all possible, it is better to avoid the uncached mode.

The new API is a way to do that for some sorts of devices. A driver can allocate a DMA buffer using:

    struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size,
					     enum dma_data_direction direction,
					     gfp_t gfp, unsigned long attrs);

This function will attempt to allocate size bytes of memory for DMA by dev in the given direction using the given gfp flags. That buffer may not be physically contiguous in system memory, but the returned scatter/gather table will be set up with a single, contiguous range for the DMA device. An I/O memory-management unit (IOMMU) is clearly required for the system to be able to arrange that; it's an important feature, though, since some devices cannot do scatter/gather I/O without IOMMU assistance. The only accepted value for attrs is DMA_ATTR_ALLOC_SINGLE_PAGES, which is a hint that it's not worthwhile for the DMA-mapping code to try to use huge pages for this buffer.

This buffer may well not be cache-coherent. As with other noncoherent mappings, cache management must be done by hand. So, for example, a call to dma_sync_sgtable_for_device() must be done before handing the memory over to the device for I/O; it will make sure that any dirty cache lines will be flushed out to the memory, among other things. To take control back from the device, dma_sync_sgtable_for_cpu() must be called.

The buffer can be freed with:

    void dma_free_noncontiguous(struct device *dev, size_t size,
    				struct sg_table *sgtable,
				enum dma_data_direction dir);

The parameters must match those used when the buffer was allocated.

This buffer is not directly accessible by the CPU when returned. If the kernel needs a mapping into kernel space, that can be managed with:

    void *dma_vmap_noncontiguous(struct device *dev, size_t size,
    				 struct sg_table *sgtable);
    void dma_vunmap_noncontiguous(struct device *dev, void *vaddr);

The existence of a kernel mapping does not make cache-coherency issues go away, though. If the kernel may have written to this buffer, flush_kernel_vmap_range() must be called to ensure any cached data makes it to memory before handing that memory to a device. Similarly, invalidate_kernel_vmap_range() must be called to remove any cached data for memory that may have been written by the device.

Finally, it is possible to map the buffer into user space with a call to:

    int dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma,
			       size_t size, struct sg_table *sgt);

This will normally be done in response to an mmap() call by the application, which can munmap() the memory when it's no longer needed. Needless to say, if user space accesses this buffer when it is in the device's hands, the results may be less than optimal. In cases where the ownership of the buffer is managed explicitly in user space (such as with the Video4Linux2 API, for example), access at the wrong time should not be a problem.

Also merged in 5.13 was a patch to the uvcvideo driver to switch it from using coherent mappings to the new API. According to the changelog, this change can, on non-cache-coherent systems, improve performance by a factor of 20, which seems worth the effort. Chances are that other drivers will make the switch at some point. It's the kind of change that is not immediately evident to users, but which makes the system perform much better in the end.

Comments (8 posted)

The second half of the 5.13 merge window

By Jonathan Corbet
May 10, 2021

By the time the last pull request was acted on and 5.13-rc1 was released, a total of 14,231 non-merge commits had found their way into the mainline. That makes the 5.13 merge window larger than the entire 5.12 development cycle (13,015 commits) and just short of all of 5.11 (14,340). In other words, 5.13 looks like one of the busier development cycles we have seen for a little while. About 6,400 of these commits came in after the first-half summary was written, and they include a number of significant new features.

Changes merged in the second half of the 5.13 merge window include:

Architecture-specific

The arm64 architecture has settled on SPARSEMEM_VMEMMAP as the only supported memory-management model.
32-Bit PowerPC systems now have support for extended BPF and the KFENCE debugging system.
PowerPC systems now support time namespaces.
The RISC-V architecture has gained support for kexec, crash dumps via kexec, execute-in-place, and kprobes.
s390 systems can now do stack-offset randomization in system-call handlers.

Core kernel

BPF tracing programs may now make use of task-local storage, which provides a number of performance benefits over using maps.
There is a new mechanism by which BPF programs can call kernel functions directly; its initial use is in the implementation of TCP congestion-control algorithms. Functions must be explicitly whitelisted to be made available for calling from BPF. Some information can be found in this commit.
The function tracer (ftrace) has a new func-no-repeats option that causes multiple, consecutive calls to a function to be coalesced into a simple count in the output.
User-space page-fault handling with userfaultfd() can now manage minor faults (those where a valid page exists but a valid page-table entry does not). See this commit for information on this feature and how it is meant to be used.
The old (but dangerous) /dev/kmem special file, which provided access to the kernel's address space, has been removed at last.

Filesystems and block I/O

The exFAT filesystem has gained support for the FITRIM ioctl() command, which is used to inform the storage device about blocks that are no longer used.
The XFS filesystem now can remove space from the last allocation group in the filesystem; this is a first step toward the ability to shrink XFS filesystems in general.
There is a new quota-related system call:
```
    int quotactl_path(unsigned int cmd, const char *mountpoint, qid_t id,
    		      void *addr);
```
Its behavior is similar to quotactl(), except that it expects the path to the mount point of a filesystem rather than the block special device holding that filesystem.
The fanotify mechanism has always restricted a number of features to privileged users, but some of those restrictions have been removed for 5.13. See this commit for a cryptic description of what's allowed and what is not.
The ext4 filesystem will now overwrite directory entries when files are deleted. Ext4 can now also handle filesystems that use both case folding and encryption.

Hardware support

Human-interface devices: FTDI FT260 USB HID to I2C host bridges, Microsoft Surface system aggregator module HID transports, Azoteq IQS626A capacitive touch controllers, MStar msg2638 touchscreens, Ilitek I2C 213X/23XX/25XX/Lego series touch controllers, and Hycon hy46xx touchscreens.
Miscellaneous: Silicon Labs CP2615 USB I2C adapters, HiSilicon I2C controllers, Unisoc IOMMUs, Intel Data Accelerators performance monitors, Toshiba Visconti pulse-width modulators, MediaTek Gen3 PCIe controllers, and SiFive FU740 PCIe host controllers.
Networking: Marvell 88X2222 PHYs, Broadcom BCM6368 MDIO bus multiplexers, Actions Semi Owl Ethernet MACs, Microsoft Azure network adapters, NXP C45 TJA11XX PHYs, and Microchip KSZ8863 and KSZ8873 switches.
Pin control: Broadcom BCM63xx GPIO controllers, Mediatek MT8195 pin controllers, Xilinx ZynqMP pin controllers, and Realtek Otto GPIO controllers.
Sound: Realtek RT1316 codecs, Realtek RT711 and RT715 SDCA codecs, Realtek RT1019 mono class-D audio amplifiers, and MediaTek MT6359 ACCDET jack controllers.
Virtio: new virtio drivers have been added for Bluetooth controllers and sound devices.

Security-related

After years of work and 34 revisions, the Landlock security module has finally been merged.

The goal of Landlock is to enable to restrict ambient rights (e.g. global filesystem access) for a set of processes. Because Landlock is a stackable LSM, it makes possible to create safe security sandboxes as new security layers in addition to the existing system-wide access-controls.

See landlock.io for more information.

Internal kernel changes

vmalloc() can now create huge-page mappings.
There are some new functions for allocating batches of pages in a single call: alloc_pages_bulk() and alloc_pages_bulk_array(). Documentation is scarce, but some information can be found in this commit and this one.
See also this article for more information on both of the above changes.

One feature that failed to make it this time around is the proposed memfd_secret() system call, which creates areas of memory that are hidden from the rest of the system (including the kernel). Andrew Morton expressed doubts about the utility of the feature, which brought out a few potential users saying that they would like to have it. Morton now appears to be convinced but, by the time that happened, it was too late for 5.13. So memfd_secret() looks set to make its appearance in 5.14 instead.

If the 5.13 kernel is released after seven -rc cycles, it will come out on June 27; if a -rc8 is required, 5.13 will supply an added cause for fireworks and celebration in the US on July 4. There will need to be a lot of testing and fixing between now and then and, if past experience holds, approximately 2,000 more commits. The merge window is done, but there is still a fair amount of work to be done to get the next kernel release out.

Comments (4 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Linux 5.13-rc1; DragonFly BSD 6.0; coreboot 4.14; eBPF for Windows; IEEE on UMN; Quotes; ...
Announcements: Newsletters; conferences; security updates; kernel patches; ...

Next page: Brief items>>