Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.9-rc2, released on October 23. Linus is asking for people to test one feature in particular: "My favorite new feature that I called out in the rc1 announcement (the virtually mapped stacks) is possibly implicated in some crashes that Dave Jones has been trying to figure out, so if you want to be helpful and try to see if you can give more data, please make sure to enable CONFIG_VMAP_STACK."

The latest 4.9 regression report shows 14 known problems.

Stable updates: 4.8.3, 4.7.9, and 4.4.26, containing the "Dirty COW" fix, were released on October 20. 4.8.4, 4.7.10, and 4.4.27 followed on October 22. Note that 4.7.10 is the end of the 4.7.x series.

The 4.8.5 (140 changes) and 4.4.28 (112 changes) updates are in the review process as of this writing; they can be expected on or after October 28.

Comments (none posted)

Quotes of the week

Being a Linux kernel maintainer has been my proudest professional accomplishment, spanning the last 19 years. But now we have a surfeit of excellent hackers, and I can hand this over without regret.

— Rusty Russell moves on

We're rapidly moving away from the world where a page cache is needed to give applications decent performance. DAX doesn't have a page cache, applications wanting to use high IOPS (hundreds of thousands to millions) storage are using direct IO, because the page cache just introduces latency, memory usage issues and non-deterministic IO behaviour.

If we try to make the page cache the "one true IO optimisation source" then we're screwing ourselves because the incoming IO technologies simply don't require it anymore.

— Dave Chinner

Oh, and the patch is obviously entirely untested. I wouldn't want to ruin my reputation by *testing* the patches I send out. What would be the fun in that?

— Linus Torvalds, who subsequently tested the patch anyway (thanks to Borislav Petkov)

Comments (none posted)

The initial bus1 patch posting

The bus1 message-passing mechanism is the successor to the "kdbus" project; it was covered here in August. The patches have now been posted for review. "While bus1 emerged out of the kdbus project, bus1 was started from scratch and the concepts have little in common. In a nutshell, bus1 provides a capability-based IPC system, similar in nature to Android Binder, Cap'n Proto, and seL4."

Comments (2 posted)

Making swapping scalable

By Jonathan Corbet
October 26, 2016

The swap subsystem is where anonymous pages (those containing program data not backed by files in the filesystem) go when memory pressure forces them out of RAM. A widely held view says that swapping is almost always bad news; by the time a Linux system gets to the point where it is swapping out anonymous pages, the performance battle has already been lost. So it is not at all uncommon to see Linux systems configured with no swap space at all. Whether the relatively poor performance of swapping is a cause or an effect of that attitude is a matter for debate. What is becoming clearer, though, is that the case for using swapping is getting stronger, so there is value in making swapping faster.

Swapping is becoming more attractive as the performance of storage devices — solid-state storage devices (SSDs) in particular — increases. Not too long ago, moving a page to or from a storage device was an incredibly slow operation, taking several orders of magnitude more time than a direct memory access. The advent of persistent-memory devices has changed that ratio, to the point where storage speeds are approaching main-memory speeds. At the same time, the growth of cloud computing gives providers a stronger incentive to overcommit the main memory on their systems. If swapping can be made fast enough, the performance penalty for overcommitting memory becomes insignificant, leading to better utilization of the system as a whole.

As Tim Chen noted in a recently posted patch set, the kernel currently imposes a significant overhead on page faults that must retrieve a page from swap. The patch set addresses that problem by increasing the scalability of the swap subsystem in a few ways.

In current kernels, a swap device (a dedicated partition or a special file within a filesystem) is represented by a swap_info_struct structure. Among the many fields of that structure is swap_map, a pointer to a byte array, where each byte contains the reference count for a page stored on the swap device. The structure looks vaguely like this:

Some of the swap code is quite old; a fair amount dates back to the beginning of the Git era. In the early days, the kernel would attempt to concentrate swap-file usage toward the beginning of the device — the left end of the swap_map array shown above. When one is swapping to rotating storage, this approach makes sense; keeping data in the swap device together should minimize the amount of seeking required to access it. It works rather less well on solid-state devices, for a couple of reasons: (1) there is no seek delay on such devices, and (2) the wear-leveling requirements of SSDs are better met by spreading the traffic across the device.

In an attempt to perform better on SSDs, the swap code was changed in 2013 for the 3.12 release. When the swap subsystem knows that it is working with an SSD, it divides the device into clusters, as shown below:

The percpu_cluster pointer points to a different cluster for each CPU on the system. With this arrangement, each CPU can allocate pages from the swap device from within its own cluster, with the result that those allocations are spread across the device. In theory, this approach is also more scalable, except that, in current kernels, much of the scalability potential has not yet been achieved.

The problem, as is so often the case, has to do with locking. CPUs do not have exclusive access to any given cluster (even the one indicated by percpu_cluster), so they must acquire the lock spinlock in the swap_info_struct structure before any changes can be made. There are typically not many swap devices on any given system — there is often only one — so, when swapping is heavy, that spinlock is heavily contended.

Spinlock contention is not the path to high scalability; in this case, that contention is not even necessary. Each cluster is independent and can be allocated from without touching the others, so there is no real need to wait on a single global lock. The first order of business in the patch set is thus to add a new lock to each entry in the cluster_info array; a single-bit lock is used to minimize the added memory consumption. Now, any given CPU can allocate pages from (or free pages into) its cluster without contending with the others.

Even so, there is overhead in taking the lock, and there can be cache-line contention when accessing the lock in other CPUs' clusters (as can often happen when pages are freed, since nothing forces them to be conveniently within the freeing CPU's current cluster). To minimize that cost, the patch set adds new interfaces to allocate and free swap pages in batches. Once a CPU has allocated a batch of swap pages, it can use them without even taking the local cluster lock. Freed swap pages are accumulated in a separate cache and returned in batches. Interestingly, freed pages are not reused by the freeing CPU in the hope that freeing them all will help minimize fragmentation of the swap space.

There is one other contention point that needs to be addressed. Alongside the swap_info_struct structure, the swap subsystem maintains an address_space structure for each swap device. This structure contains the mapping between pages in memory and their corresponding backing store on the swap device. Changes in swap allocation require updating the radix tree in the address_space structure, and that radix tree is protected by another lock. Since, once again, there is typically only one swap device in the system, that is another global lock for all CPUs to contend for.

The solution in this case is a variant on the clustering approach. The address_space structure is replicated into many structures, one for each 64MB of swap space. If the swap area is sized at (say) 10GB, the single address_space will be split 160 ways, each of which has its own lock. That clearly reduces the scope for contention for any individual lock. The patch also takes care to ensure that the initial allocation of swap clusters puts each CPU into a separate address_space, guaranteeing that there will be no contention at the outset (though, once the system has been operating for a while, the swap patterns will become effectively random).

According to Chen, current kernels add about 15µs of overhead to every page fault that is satisfied by a read from a solid-state swap device. That, he says, is comparable to the amount of time it takes to actually read the data from the device. With the patches applied, that overhead drops to 4µs, a significant improvement. There have been no definitive comments on the patch set as of this writing, but it seems like the sort of improvement that the swap subsystem needs to work well with contemporary storage devices.

Comments (5 posted)

A report from the documentation maintainer

By Jonathan Corbet
October 26, 2016

It is now nearly exactly two years since my ill-advised decision to accept the role of the maintainer of the kernel's documentation collection. After a bit of a slow start, things have started to happen in the documentation area. As part of the preparation exercise for an upcoming Kernel Summit session on documentation, here is a report on where things stand and where they are going.

The biggest overall change, of course, is the transition away from a homebrew DocBook-based toolchain to a formatted documentation setup based on the Sphinx system, as was described in this article last July. The transition made some waves when it hit; in the 4.8-rc1 announcement, Linus noted that a full 20% of the patch set was documentation updates. It is fair to say that kernel developers do not ordinarily put that much effort into documentation. Much of the credit for this work goes to Daniel Vetter and Mauro Carvalho Chehab, who worked hard to transition the GPU and media subsystem documentation, respectively, along with Jani Nikula and Markus Heiser, who made the Sphinx-based plumbing work.

Perhaps unsurprisingly, there have been places where Sphinx has not worked out quite as well as desired. Perhaps the biggest initial disappointment was PDF output. The original plan was to use rst2pdf, a relatively simple tool that offered the possibility of creating PDF files without a heavy toolchain. It does indeed create pretty output for simple input files, but it falls over completely with more complex documents; after a while, it became clear that it was not going to meet the kernel community's needs.

That means falling back to LaTeX in 4.9; LaTeX works, but is not without its drawbacks. LaTeX is not a small system; the basic install on my openSUSE Tumbleweed system was over 1,700 packages. The base Fedora installation is much smaller, but that is not necessarily better. There, getting the documentation built requires executing a seemingly endless loop of "which .sty file is missing now, and which package provides it?" work. Part of the idea behind switching to Sphinx was to make setting up the toolchain easier; that goal has still been met for those who are happy with HTML or EPUB output, but remains elusive for PDF output.

After 4.8

The 4.7 kernel contains 34 "template" files that are processed by the DocBook-based toolchain; that number is down to 30 in the 4.9-rc kernels. The conversion of the remaining template files continues; eventually they will all be done and the DocBook dependency can be removed. The conversion is generally easy to do (there is a script included in the kernel source that helps), but making it all look nice can take a little longer. And updating some of the kernel's ancient documentation to match current reality may take longer yet.

A few dozen template files are one thing, but what about the various plain-text files scattered around the documentation directory? There are over 2,000 of these (not counting the device-tree files), some rather more helpful than others. Very little organizational thought has been applied to this directory. As former documentation maintainer Rob Landley put it in 2007, "Documentation/* is a gigantic mess, currently organized based on where random passers-by put things down last". It has improved little since then.

Now we are trying to improve it by applying some structure to the directory and by bringing the plain-text files into the growing body of Sphinx-based documentation. The latter task is easy — most of the plain-text files are almost in the reStructuredText format used by Sphinx already, so only minor tweaks are required. The organizational task is harder.

The 4.9 kernel will contain a couple of new sub-manuals in the Sphinx-based documentation. One of them, called dev-tools, is a collection of the plain-text documents about tools that can be used in kernel development. The other, driver-api, gathers information of interest to device-driver developers. Both of these books are works in progress, they exist in their current form mostly to show the way forward.

In 4.10, the chances are good that three more major sub-manuals will put in an appearance. One of them, tentatively called core-api, will be a collecting point for documentation about the core-kernel interfaces. That information is currently widely distributed among plain-text files and kerneldoc comments within the source itself; it will be good to have it together in one place — sometime well in the future, when the process of creating this manual has run its course.

Next, the process book will hold our (fairly extensive) documentation on how to work with the kernel development community. It includes the often-cited SubmittingPatches document (now process/submitting-patches.rst), along with information on coding style, email client configuration, and more. This work (done by Mauro) was ready in time for 4.9, but I put the brakes on it out of fear that moving files like SubmittingPatches would leave a lot of dangling links in the brains of the development community. Various discussions over the past month have failed to turn up even a single developer who was unhappy about it, though, so the current plan is for this work to proceed for 4.10.

The last proposed book recognizes that there are multiple audiences for the kernel's documentation; it will (probably) be called admin-guide and will be aimed at system administrators, users, and others who are trying to figure out how to get the kernel to do what they want. Much of our documentation covers module parameters, tuning knobs, and user-space APIs; collecting and organizing it should make it more accessible for our users.

Open issues

As this work proceeds, a number of issues have come up that are still in need of resolution; many of them come down to a tradeoff between simplicity and functionality. On the simplicity side, it is desirable to keep the documentation toolchain as simple and easy to set up as possible so that anybody can build the docs. On the other hand, making use of more functionality (and thus adding to the toolchain's dependencies) enables the creation of more expressive documentation.

One such issue is the use of the Sphinx math extension, which supports the formatting of mathematical expressions using the LaTeX syntax. As of 4.9, the media documentation is using this extension, but there is a cost: it forces the use of LaTeX even to build the HTML documentation. The hope is to find an easy way to fall back gracefully when LaTeX is unavailable in order to soften this dependency.

A deeper question has to do with the automatic generation of reStructuredText documentation from other files in the kernel tree. That is already done with the in-source kerneldoc comments, of course, but there is interest in pulling in a number of other types of information as well. That extends as far as reformatting the MAINTAINERS file as part of the documentation build process. There are patches circulating to allow, to a varying extent, the running of arbitrary programs during the documentation build to do this generation; these patches run into concerns about security and maintainability. The form of the solution to this problem is not yet entirely clear.

Interestingly, there is significant disagreement over the removal of ancient, obsolete documentation. Do we really need, say, documentation from 1996 describing how to manually bisect bugs in the 1.3.x kernel? Resistance to removing such cruft usually comes in the form of "but it might be useful to somebody someday." But we do not retain unused code on that basis; we recognize that there is a cost to carrying such code in the kernel. There is, likewise, a cost to carrying old, obsolete documentation, paid by both the documentation maintainers and the users the documentation is meant to help. In my opinion, some spring cleaning is in order, even if spring is a distant prospect in the northern hemisphere.

One other possibly contentious change has been suggested by a few people now. Documentation/ is a long name, and is the only top-level directory in the kernel starting with a capital letter. One can joke that this distinction highlights the importance of documentation, but it's also a lot for people to type. So I've been asked a few times if it could be renamed to something like "docs". That, I think, is a question for the Kernel Summit.

Finally, it should be said that much of the above consists of a rearrangement of a bunch of kernel documentation that is of varying quality and is not all current. It makes the documentation prettier and, hopefully, easier to find, but does not yet turn it into a coherent body of accessible and useful material. There is a good case for doing the organizational work first, as long as we don't forget that there is a lot more to be done.

Despite the disagreements over how to proceed in some of these areas, and despite the magnitude of the task, there is a broad consensus that the time has come to improve the kernel's documentation. More effort is going into this part of the kernel than has been seen for some years. With any luck, kernel developers, distributors, and users will all be the beneficiaries of this work. For anybody who is looking for a way to help with kernel development, there are plenty of opportunities in the documentation area; we would be happy to hear from you. The linux-doc list at vger.kernel.org is a relatively calm place to work on documentation without subjecting oneself to the linux-kernel firehose. We look forward to your patches.

Comments (73 posted)

Linus Torvalds Linux 4.9-rc2 Oct 23

Greg KH Linux 4.8.4 Oct 22

Greg KH Linux 4.8.3 Oct 20

Sebastian Andrzej Siewior 4.8.2-rt3 Oct 24

Greg KH Linux 4.7.10 Oct 22

Greg KH Linux 4.7.9 Oct 20

Greg KH Linux 4.4.27 Oct 22

Greg KH Linux 4.4.26 Oct 20

Steven Rostedt 4.4.25-rt35 Oct 20

Steven Rostedt 4.1.34-rt39 Oct 20

Steven Rostedt 3.18.43-rt46 Oct 20

Ben Hutchings Linux 3.16.38 Oct 21

Jiri Slaby Linux 3.12.66 Oct 21

Ben Hutchings Linux 3.2.83 Oct 21

Yury Norov ILP32 for ARM64 Oct 21

Mark Rutland arm64: move thread_info off of the task stack Oct 19

Thiago Jung Bauermann kexec_file_load implementation for PowerPC Oct 21

Tim Chen Support Intel Turbo Boost Max Technology 3.0 Oct 20

Fenghua Yu Intel Cache Allocation Technology Oct 22

Nicolas Pitre make POSIX timers optional with some Kconfig help Oct 19

Luca Abeni CPU reclaiming for SCHED_DEADLINE Oct 24

Martijn Coenen android: binder: support for domains and scatter-gather. Oct 24

Daniel Mack Add eBPF hooks for cgroups Oct 25

David Herrmann Bus1 Kernel Message Bus Oct 26

Jan Glauber Cavium ThunderX uncore PMU support Oct 20

Imran Khan soc: qcom: Add SoC info driver Oct 20

Erin Lo Add clock support for Mediatek MT2701 Oct 21

Yang Ling [PATCH v2.1 1/2] watchdog: loongson1: Add Loongson1 SoC watchdog driver Oct 21

Neil Armstrong net: stmmac: Add OXNAS Glue Driver Oct 21

gabriel.fernandez@st.com STM32F4 Add RTC & QSPI clocks Oct 21

Fabrice Gasnier Add support for STM32 ADC Oct 25

Hardik Shah SoundWire bus driver Oct 21

Lubomir Rintel char/pcmcia: add scr24x_cs chip card interface driver Oct 20

Neil Armstrong pinctrl: Add SX150X GPIO Extender Pinctrl Driver Oct 21

Michael Scott pinctrl: qcom: Add msm8994 pinctrl driver Oct 21

Srinivas Kandagatla ASoC: Add support to Qualcomm msm8916-wcd multi codec Oct 20

Pavel Machek media: Driver for Toshiba et8ek8 5MP sensor Oct 23

Laurent Pinchart v4l: platform: Add Renesas R-Car FDP1 Driver Oct 24

Bartosz Golaszewski da850: DDR2/mDDR memory controller driver Oct 24

Steve Twiss da9061: DA9061 driver submission Oct 26

Jonathan Richardson Add support for Broadcom OTP controller Oct 24

Benjamin Gaignard add ION driver for STIh4xx SoC Oct 26

Rongrong Zou Add DRM driver for Hisilicon Hibmc Oct 26

Bjorn Andersson rproc subdevice support Oct 19

Viresh Kumar PM / OPP: Multiple regulator support Oct 20

Gustavo Padovan drm: add explicit fencing Oct 20

Anshuman Khandual Define coherent device memory node Oct 24

Alexander Duyck [net-next PATCH RFC 00/26] Add support for DMA writable pages being writable by the network stack Oct 24

Brian Starkey Introduce writeback connectors Oct 26

Antoine Tenart Add an overlay manager to handle board capes Oct 26

Jonathan Corbet Organize and clean up the admin and process guides Oct 26

Richard Weinberger UBIFS File Encryption Oct 21

Kirill A. Shutemov ext4: support of huge pages Oct 25

Miklos Szeredi overlayfs: allow moving directory trees Oct 25

Paolo Valente introduce the BFQ-v0 I/O scheduler as an extra scheduler Oct 26

Jens Axboe block: buffered writeback throttling Oct 26

Tim Chen mm/swap: Regular page swap optimizations Oct 20

Tom Herbert udp: Flow dissection for tunnels Oct 19

Johannes Berg genetlink improvements Oct 24

Florian Westphal netfilter: add fib expression Oct 24

David Lebrun net: add support for IPv6 Segment Routing Oct 26

Thiago Jung Bauermann ima: carry the measurement list across kexec Oct 21

Tetsuo Handa CaitSith LSM module Oct 21

John Stultz cgroup: Use CAP_SYS_RESOURCE to allow a process to migrate other tasks between cgroups Oct 20

Stephan Mueller /dev/random - a new approach code for 4.9-rc1 Oct 23

Mat Martineau Make keyring link restrictions accessible from userspace Oct 20

Elena Reshetova HARDENED_ATOMIC Oct 20

Mickaël Salaün Landlock LSM: Unprivileged sandboxing Oct 26

Liang Li Extend virtio-balloon for fast (de)inflating & fast live migration Oct 21

Punit Agrawal Add support for monitoring guest TLB operations Oct 26

Arnaldo Carvalho de Melo New Tool: perf c2c Oct 20

Kernel development

Brief items

Kernel release status

Quotes of the week

The initial bus1 patch posting

Kernel development news

Making swapping scalable

A report from the documentation maintainer

After 4.8

Open issues

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous