Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.11-rc2, released on March 12. Linus said: "I think we're in fine shape for this stage in the development kernel, it shouldn't be particularly scary to just say 'I'll be a bit adventurous and test an rc2 kernel'. Yes, it's early rc time still, but go on, help us make sure we're doing ok."

The March 14 4.11 regression report shows nine known problems.

Stable updates: 4.10.2, 4.9.14, and 4.4.53 were released on March 12, followed by 4.10.3, 4.9.15, and 4.4.54 on March 15.

Comments (none posted)

Quotes of the week

Mis-spelling someone else's email can be cut and paste; mis-spelling your own might be the early indications of an identity crisis.

— James Bottomley

This is why you _must_ get anything you're doing discussed in upstream first. Your internal teams simply do not have design authority on stuff like that.

— Daniel Vetter (also available in T-shirt form).

Comments (none posted)

Kernel podcast

The March 13 kernel podcast from Jon Masters is out. "In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc2 (including pre-enablement for Intel 5-level paging), VMA based swap readahead, and ongoing development ahead of the next cycle."

Comments (none posted)

Five-level page tables

By Jonathan Corbet
March 15, 2017

Near the beginning of 2005, the merging of the four-level page tables patches for 2.6.10 was an early test of the (then) new kernel development model. It demonstrated that the community could indeed merge fundamental changes and get them out quickly to users — a far cry from the multi-year release cycles that prevailed before the 2.6.0 release. The merging of five-level page tables (outside of the merge window) for 4.11-rc2, instead, barely raised an eyebrow. It is, however, a significant change that is indicative of where the computing industry is going.

A page table, of course, maps a virtual memory address to the physical address where the data is actually stored. It is conceptually a linear array indexed by the virtual address (or, at least, by the page-frame-number portion of that address) and yielding the page-frame number of the associated physical page. Such an array would be large, though, and it would be hugely wasteful. Most processes don't use the full available virtual address space even on 32-bit systems, and they don't use even a tiny fraction of it on 64-bit systems, so the address space tends to be sparsely populated and, as a result, much of that array would go unused.

The solution to this problem, as implemented in the hardware for decades, is to turn the linear array into a sparse tree representing the address space. The result is something that looks like this:

[Four-level page tables]

The row of boxes across the top represents the bits of a 64-bit virtual address. To translate that address, the hardware splits the address into several bit fields. Note that, in the scheme shown here (corresponding to how the x86-64 architecture uses addresses), the uppermost 16 bits are discarded; only the lower 48 bits of the virtual address are used. Of the bits that are used, the nine most significant (bits 39-47) are used to index into the page global directory (PGD); a single page for each address space. The value read there is the address of the page upper directory (PUD); bits 30-38 of the virtual address are used to index into the indicated PUD page to get the address of the page middle directory (PMD). With bits 21-29, the PMD can be indexed to get the lowest level page table, just called the PTE. Finally, bits 12-20 of the virtual address will, when used to index into the PTE, yield the physical address of the actual page containing the data. The lowest twelve bits of the virtual address are the offset into the page itself.

At any level of the page table, the pointer to the next level can be null, indicating that there are no valid virtual addresses in that range. This scheme thus allows large subtrees to be missing, corresponding to ranges of the address space that have no mapping. The middle levels can also have special entries indicating that they point directly to a (large) physical page rather than to a lower-level page table; that is how huge pages are implemented. A 2MB huge page would be found directly at the PMD level, with no intervening PTE page, for example.

One can quickly see that the process of translating a virtual address is going to be expensive, requiring several fetches from main memory. That is why the translation lookaside buffer (TLB) is so important for the performance of the system as a whole, and why huge pages, which require fewer lookups, also help.

It is worth noting that not all systems run with four levels of page tables. 32-Bit systems use three or even two levels, for example. The memory-management code is written as if all four levels were always present; some careful coding ensures that, in kernels configured to use fewer levels, the code managing the unused levels is transparently left out.

Back when four-level page tables were merged, your editor wrote: "Now x86-64 users can have a virtual address space covering 128TB of memory, which really should last them for a little while." The value of "a little while" can now be quantified: it would appear to be about 12 years. Though, in truth, the real constraint appears to be the 64TB of physical memory that current x86-64 processors can address; as Kirill Shutemov noted in the x86 five-level page-table patches, there are already vendors shipping systems with that much memory installed.

As is so often the case in this field, the solution is to add another level of indirection in the form of a fifth level of page tables. The new level, called the "P4D", is inserted between the PGD and the PUD. The patches adding this level were merged for 4.11-rc2, even though there is, at this point, no support for five-level paging on any hardware. While the addition of four-level page tables caused a bit of nervousness, the five-level patches merged were described as "low risk". At this point, the memory-management developers have a pretty good handle on the changes that need to be made to add another level.

The patches adding five-level support for upcoming Intel processors is currently slated for 4.12. Systems running with five-level paging will support 52-bit physical addresses and 57-bit virtual addresses. Or, as Shutemov put it: "It bumps the limits to 128 PiB of virtual address space and 4 PiB of physical address space. This 'ought to be enough for anybody'." The new level also allows the creation of 512GB huge pages.

The current patches have a couple of loose ends to take care of. One of those is that Xen will not work on systems with five-level page tables enabled; it will continue to work on four-level systems. There is also a need for a boot-time flag to allow switching between four-level and five-level paging so that distributors don't have to ship two different kernel binaries.

Another interesting problem is described at the end of the patch series. It would appear that there are programs out there that "know" that only the bottom 48 bits of a virtual address are valid. They take advantage of that knowledge by encoding other information in the uppermost bits. Those programs will clearly break if those bits suddenly become part of the address itself. To avoid such problems, the x86 patches in their current form will not allocate memory in the new address space by default. An application that needs that much memory, and which does not play games with virtual addresses, can provide an address hint above the boundary in a call to mmap(), at which point the kernel will understand that mappings in the upper range are accessible.

Anybody wanting to play with the new mode can do so now with QEMU, which understands five-level page tables. Otherwise it will be a matter of waiting for the processors to come out — and the funds to buy a machine with that much memory in it. When the hardware is available, the kernel should be ready for it.

Comments (13 posted)

A deadline scheduler update

By Jonathan Corbet
March 14, 2017

Linaro Connect

The deadline CPU scheduler has come a long way, Juri Lelli said in his 2017 Linaro Connect session, but there is still quite a bit of work to be done. While this scheduler was originally intended for realtime workloads, there is reason to believe that it is well suited for other settings, including the embedded and mobile world. In this talk, he gave a summary of what the deadline scheduler provides now and the changes that are envisioned for the near (and not-so-near) future.

The deadline scheduler was merged in the 3.14 development cycle. It adds a realtime scheduling policy that, in many ways, is more powerful than traditional, priority-based scheduling. It allows for the specification of explicit latency constraints and avoids starvation of processes by design. The scheduler has better information about the constraints of the workload it is running and can thus make better decisions.

The kernel's scheduler is based on the earliest deadline first (EDF) algorithm, under which the process with the first-expiring deadline is the one that is chosen to run. EDF is enhanced with the constant bandwidth server (CBS) algorithm (described in detail in this article), which prevents a process that is unable to run for much of its period from interfering with others. Essentially, CBS says that a deadline process must use its CPU-time reservation over the course of its scheduling period, rather than procrastinating and expecting the full reservation to be available right before the deadline. The result is a scheduler that provides strong temporal isolation for tasks, where no process can prevent another from satisfying its deadlines.

At the moment, the mobile and embedded development community is putting a lot of effort into energy-aware scheduling. This work has a valuable goal — making scheduling decisions that minimize a system's energy use — but it has proved to be hard to get upstream, though it has been merged into the Android common kernel. For many workloads, it may be that deadline scheduling is a better fit for energy-conscious workloads in the end, Lelli said

A new feature under development is bandwidth reclaiming. One of the core features of deadline scheduling is that, when a process exceeds its CPU-time reservation (its CPU "bandwidth"), the scheduler will simply throttle it until its next scheduling period. This throttling is needed to ensure that the process cannot interfere with other processes on the system, but it can be a problem if a process occasionally has a legitimate need for more than its allotted time. Bandwidth reclaiming just does the obvious thing: it gives that process more time if it's not needed by other processes in the system.

What is perhaps less obvious is the determination of how much CPU time is not actually needed. This is done with the GRUB ("greedy utilization of unused bandwidth") algorithm described in this paper [PDF]. In short, GRUB tracks how much of the available CPU time is actually being used by the set of running deadline tasks and, from that, forms an estimate of how much CPU time will go unused. With that estimate in place, handing out some of the spare time to a deadline process that finds itself in need is a relatively straightforward business.

CPU-frequency scaling is an important tool in the power-management portfolio, but it has traditionally not played well with realtime scheduling algorithms, including deadline scheduling. In current kernels, it is assumed that realtime tasks need the full power of the CPU, so the presence of such tasks will cause the CPU to run at full speed. That may be wasteful, though, if the deadline processes running do not need that much CPU time.

Fixing that problem requires a number of changes, starting with the fact that the deadline scheduler itself assumes that the CPU will be running at full speed. The scheduler needs to be fixed so that it can scale reservation times to match the current speed of the processor. This awareness needs to be expanded to heterogeneous multiprocessing systems (such as big.LITTLE, where not all processors in the system are equally fast) as well.

Once that is done, it would be beneficial to be able to drive CPU-frequency selection from the deadline scheduler as well. The CFS scheduler used for normal tasks uses the per-entity load tracking mechanism to help with frequency selection, but the deadline scheduler currently just pushes the frequency to the maximum. Once the bandwidth reclaiming code is in, it will become possible to measure and use the actual load added by deadline tasks. At that point, a CPU frequency that efficiently gets all of the necessary work done can be chosen.

Of course, there are always a number of fiddly details to take care of. For example, on ARM systems, CPU-frequency changes are done in a separate worker thread. For CPU scaling and deadline scheduling to work together, a way for that thread to preempt deadline tasks (which are normally not preemptable) will need to be found.

The deadline scheduler currently works at the level of individual processes; it does not work with control groups. But there are times when it might make sense to give a deadline reservation to a group of processes. Lelli cited virtual-machine threads and rendering pipelines as a couple of candidate workloads for group deadline scheduling. The implementation envisioned here would be a sort of two-level hybrid hierarchy. At the top level, the EDF algorithm would be used to pick the next group to execute; within that group, though, normal realtime scheduling (FIFO or round-robin) would be used instead. Once this feature works, he said, it could supplant the longstanding realtime throttling mechanism.

Looking further ahead, Lelli said there is a scheme to extend the bandwidth reclaiming mechanism to allow priority demotion. Once a process exceeds its reservation, it will continue to run, but as a normal process without realtime priority. That priority will be restored once the next scheduling period starts. There is also a strong desire to have fully energy-aware scheduling in the deadline scheduler.

A more distant wishlist item is support for single-CPU affinity. The priority inheritance mechanism could also stand some improvements. Currently, a task that blocks a deadline task will inherit that task's deadline. Replacing that with an algorithm like the multiprocessor bandwidth inheritance protocol [PDF] would be desirable. There is also a wish for a dynamic feedback mechanism that could adjust a process's reservation based on its observed needs. But, for the time being, he said, nobody is actually working on these items.

The video of this session is available.

[Thanks to Linaro and the Linux Foundation for funding your editor's travel to Connect.]

Comments (2 posted)

Linus Torvalds Linux 4.11-rc2 Mar 12

Greg KH Linux 4.10.3 Mar 15

Greg KH Linux 4.10.2 Mar 12

Greg KH Linux 4.9.15 Mar 15

Greg KH Linux 4.9.14 Mar 12

Greg KH Linux 4.4.54 Mar 15

Greg KH Linux 4.4.53 Mar 12

Steven Rostedt 4.4.50-rt63 Mar 10

Julia Cartwright 4.1.38-rt46 Mar 10

Steven Rostedt 3.18.48-rt53 Mar 08

Steven Rostedt 3.18.48-rt54 Mar 10

Jiri Slaby Linux 3.12.71 Mar 10

Steven Rostedt 3.12.70-rt95 Mar 10

Steven Rostedt 3.10.105-rt119 Mar 08

Steven Rostedt 3.10.105-rt120 Mar 10

Steven Rostedt 3.2.86-rt124 Mar 08

Kyle Huey x86/arch_prctl Add ARCH_[GET|SET]_CPUID for controlling the CPUID instruction Mar 11

Viresh Kumar cpufreq: schedutil: Allow remote wakeups Mar 09

Byungchul Park lockdep: Implement crossrelease feature Mar 14

Vivek Gautam phy: USB and PCIe phy drivers for Qcom chipsets Mar 09

Roy Pledge staging: fsl-mc: add dpio driver Mar 08

Elaine Zhang rk808: Add RK805 support Mar 09

Shilpasri G Bhat Add support for OCC inband sensors in P9 Mar 09

Steve Longerbeam i.MX Media Driver Mar 09

Peter Rosin mux controller abstraction and iio/i2c muxes Mar 10

Jan Glauber Cavium MMC driver Mar 10

Geert Uytterhoeven Add HD44780 Character LCD support Mar 10

Jaghathiswari Rankappagounder Natarajan Support for ASPEED AST2400/AST2500 PWM and Fan Tach driver Mar 10

zhichang.yuan LPC: legacy ISA I/O support Mar 13

M'boumba Cedric Madianga Add STM32 DMAMUX support Mar 13

M'boumba Cedric Madianga Add STM32 MDMA driver Mar 13

Fabien Dessenne STM32 CRC crypto driver Mar 14

Yannick Fertre STM32 LCD-TFT display controller Mar 15

sean.wang@mediatek.com net-next: dsa: add Mediatek MT7530 support Mar 14

Bartosz Golaszewski ata: ahci-dm816: new driver Mar 13

Stanimir Varbanov Qualcomm video decoder/encoder driver Mar 13

Corentin Labbe net-next: stmmac: add dwmac-sun8i ethernet driver Mar 14

Andrey Smirnov GPCv2 power gating driver Mar 14

Quentin Schulz add support for AXP20X and AXP22X power supply drivers Mar 15

Eddie James drivers: hwmon: Add On-Chip Controller driver Mar 14

Anup Patel Broadcom FlexRM ring manager support Mar 15

Niklas Söderlund rcar-vin: Add Gen3 with media controller support Mar 14

Icenowy Zheng [PATCH v2 0/5] Add support for the R_CCU on Allwinner H3/A64 SoCs Mar 16

Kishon Vijay Abraham I PCI: Support for configurable PCI endpoint Mar 09

Jens Wiklander generic TEE subsystem Mar 10

Gustavo Padovan V4L2 explicit synchronization support Mar 13

Michael Kerrisk (man-pages) man-pages-4.10 is released Mar 15

Darrick J. Wong xfs: online scrub support Mar 10

Amir Goldstein fanotify: super block root watch Mar 13

Ram Pai DM: dm-inplace-compress: inplace compressed DM target Mar 13

Filip Štědronský fanotify: new event FAN_MODIFY_DIR Mar 14

Omar Sandoval block: callback-based statistics Mar 14

Kirill A. Shutemov x86: 5-level paging enabling for v4.12 Mar 13

Zi Yan mm: page migration enhancement for thp Mar 13

Till Smejkal Introduce first class virtual address spaces Mar 13

Aaron Lu mm: support parallel free of memory Mar 15

Hannes Frederic Sowa afnetns: new namespace type for separation on protocol level Mar 13

Stephan Müller /dev/random - a new approach (code for 4.11-rc1) Mar 10

Stefano Stabellini Xen transport for 9pfs frontend driver Mar 08

James Hogan KVM: MIPS: Add VZ support Mar 14

Matthew Wilcox memset_l and memfill Mar 08

Wolfram Sang i2c-tools: add new tool 'i2ctransfer' Mar 13

Andi Kleen perf: Improve support for uncore JSON event lists Mar 10

Jozsef Kadlecsik ipset 6.32 released Mar 12

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel podcast

Kernel development news

Five-level page tables

A deadline scheduler update

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous