Kernel development [LWN.net]

Kernel release status

The 4.5 merge window is open, following the 4.4 release on January 10. See the separate article below for a summary of what has been merged thus far.

Stable updates: none have been released since December 14.

Comments (none posted)

As of this writing, just over 3,100 non-merge changesets have been pulled into the mainline repository for the 4.5 development cycle. As one would expect three days into the merge window, things are just getting started. Nonetheless, a number of significant changes have already been pulled. Some of the more interesting of those are:

The device mapper's dm-verity subsystem, which is charged with validating the integrity of data on the underlying storage device, has gained the ability to perform forward error correction. This allows for the recovery of data from a device where "several consecutive corrupted blocks" exist. The first consumer for this appears to be Android, which uses dm-verity already.
As usual, there is a long list of improvements to the perf events subsystem; see this merge commit for a detailed summary.
Mandatory file locking is now optional at configuration time. This is a first step toward the removal (sometime in the distant future) of this unloved and little-used feature.
The copy_file_range() system call has been merged. It allows for the quick copying of a portion of a file, with the operation possibly optimized by the underlying filesystem. The support code for copy_file_range() has also enabled an easy implementation of the NFSv4.2 CLONE operation.
The User-Mode Linux port now supports the seccomp() system call.
The SOCK_DESTROY operation, allowing a system administrator to shut down an open network connection, is now supported.
The "clsact" network queueing discipline module has been added; see this commit changelog for details and usage information.
The "version 2" control-group interface is now considered official and non-experimental; it can be mounted with the cgroup2 filesystem type. Not all controllers support this interface yet, though. See Documentation/cgroup-v2.txt for details on the new interface.
New hardware support includes:
- Cryptographic: Rockchip cryptographic engines and Intel C3xxx, C3xxxvf, C62x, and C62xvf cryptographic accelerators.
- Miscellaneous: HiSilicon MBIGEN interrupt controllers, Technologic TS-4800 interrupt controllers, and Cirrus Logic CS3308 audio analog-to-digital converters.
- Networking: Netronome NFP4000/NFP6000 VF interfaces, Analog Devices ADF7242 SPI 802.15.4 wireless controllers, Freescale data-path acceleration architecture frame manager devices, IBM VNIC virtual interfaces, and STMicroelectronics ST95HF NFC transceivers.
- Pin control: Qualcomm MSM8996 pin controllers, Marvell PXA27x pin controllers, Broadcom NSP GPIO controllers, and Allwinner H3 pin controllers.

Changes visible to kernel developers include:

The follow_link() method in struct inode_operations has been replaced with:
```
    const char *(*get_link) (struct dentry *dentry, struct inode *inode,
    			     struct delayed_call *done);
```
It differs from follow_link() (which was described in this article) by separating the dentry and inode arguments and, most importantly, being callable in the RCU-walk mode. In that case, dentry will be null, and get_link() is not allowed to block.
Also added in the same patch set was a "poor man's closures" mechanism, represented by struct delayed_call:
```
    struct delayed_call {
	void (*fn)(void *);
	void *arg;
    };
```
See include/linux/delayed_call.h for the (tiny) full interface. In this case, get_link() should set done->fn to its inode destructor function — probably the one that was previously made available as the (now removed) put_link() inode_operations method.
There is a new memory-barrier primitive:
```
    void smp_cond_acquire(condition);
```
It will spin until condition evaluates to a non-zero value, then insert a read barrier.
There is a new stall detector for workqueues; if any workqueue fails to make progress for 30 seconds, the kernel will output a bunch of information that should help in debugging of problem.
There is a new helper function:
```
    void *memdup_user_nul(const void __user *src, size_t len);
```
It will copy len bytes from user space, starting at src, allocating memory for the result and adding a null-terminating byte. Over 50 call sites have already shown up in the kernel.
The configfs virtual filesystem now supports binary attributes; see the documentation changes at the beginning of this commit for details.
Changes to the networking core mean that NAPI network drivers get busy polling for free, without the need to add explicit support.
Patches moving toward the removal of protocol-specific checksumming from networking drivers (described in this article) have been merged. See this merge commit for more information.

The 4.5 merge window will probably stay open until January 24, so there is time for a lot more changes to find their way into the mainline. As usual, LWN will track those changes and summarize them in the coming weeks; stay tuned.

Comments (10 posted)

Fixing asynchronous I/O, again

By Jonathan Corbet
January 13, 2016

The process of adding asynchronous I/O (AIO) support to the kernel began with the 2.5.23 development kernel in June 2002. Sometimes it seems that the bulk of the time since then has been taken up by complaints about AIO in the kernel. That said, AIO meets a specific need and has users who depend on it. A current attempt to improve the AIO subsystem has brought out some of those old complaints along with some old ideas for improving the situation.

Linux AIO does suffer from a number of ailments. The subsystem is quite complex and requires explicit code in any I/O target for it to be supported. The API is not considered to be one of our best and is not exposed by the GNU C library; indeed, the POSIX AIO support in glibc is implemented in user space and doesn't use the kernel's AIO subsystem at all. For files, only direct I/O is supported; despite various attempts over the years, buffered I/O is not supported. Even direct I/O can block in some settings. Few operations beyond basic reads and writes are supported, and those that are (fsync(), for example) are incomplete at best. Many have wished for a better AIO subsystem over the years, but what we have now still looks a lot like what was merged in 2002.

Benjamin LaHaise, the original implementer of the kernel AIO subsystem, has recently returned to this area with this patch set. The core change here is to short out much of the kernel code dedicated to the tracking, restarting, and cancellation of AIO requests; instead, the AIO subsystem simply fires off a kernel thread to perform the requested operation. This approach is conceptually simpler; it also has the potential to perform better and, in many cases, makes cancellation more reliable.

With that core in place, Benjamin's patch set adds a number of new operations. It starts with fsync(), which, in current kernels, only works if the operation's target supports it explicitly. A quick grep shows that, in the 4.4 kernel, there is not a single aio_fsync() method defined, so asynchronous fsync() does not work at all. With AIO based on kernel threads, it is a simple matter to just call the regular fsync() method and instantly have working asynchronous fsync() for any I/O target supporting AIO in general (though, as Dave Chinner pointed out, Benjamin's current implementation does not yet solve the whole problem).

In theory, fsync() is supported by AIO now, even if it doesn't actually work. A number of other things are not. Benjamin's patch set addresses some of those gaps by adding new operations, including openat() (opens are usually blocking operations), renameat(), unlinkat(), and poll(). Finally, it adds an option to request reading pages from a file into the page cache (readahead) with the intent that later attempts to access those pages will not block.

For the most part, adding these features is easy once the thread mechanism is in place; there is no longer any need to track partially completed operations or perform restarts. The attempts to add buffered I/O support to AIO in the past were pulled down by their own complexity; adding that support with this mechanism (not done in the current patch set) would not require much more than an internal read() or write() call. The one exception is the openat() support, which requires the addition of proper credential handling to the kernel thread.

The end result would seem to be a significant improvement to the kernel's AIO subsystem, but Linus still didn't like it. He is happy with the desired result and with much of the implementation, but he would like to see the focus be on the targeted capabilities rather than improving an AIO subsystem that, in his mind, is not really fixable. As he put it:

If you want to do arbitrary asynchronous system calls, just *do* it. But do _that_, not "let's extend this horrible interface in arbitrary random ways one special system call at a time".

In other words, why is the interface not simply: "do arbitrary system call X with arguments A, B, C, D asynchronously using a kernel thread".

That's something that a lot of people might use. In fact, if they can avoid the nasty AIO interface, maybe they'll even use it for things like read() and write().

Linus suggested that the thread-based implementation in Benjamin's patch set could be adapted to this sort of use, but that the interface needs to change.

Thread-based asynchronous system calls are not a new idea, of course; it has come around a number of times in the past under names like fibrils, threadlets, syslets, and acall. Linus even once posted an asynchronous system call patch of his own as these discussions were happening. There are some challenges to making asynchronous system calls work properly; there would have to be, for example, a whitelist of the system calls that can be safely run in this mode. As Andy Lutomirski pointed out, "exit is bad". Linus also noted that many system calls and structures as presented by glibc differ considerably from what the kernel provides; it would be difficult to provide an asynchronous system call API that could preserve the interface as seen by programs now.

Those challenges are real, but they may not prevent developers from having another look at the old ideas. But, as Benjamin was quick to point out, none of those approaches ever got to the point where they were ready to be merged. He seemed to think that another attempt now might run into the same sorts of complexity issues; it is not hard to conclude that he would really rather continue with the approach he has taken thus far.

Chances are, though, that this kind of extension to the AIO API is unlikely to make it into the mainline until somebody shows that the more general asynchronous system call approach simply isn't workable. The advantages of the latter are significant enough — and dislike for AIO strong enough — to create a lot of pressure in that direction. Once the dust settles, we may finally see the merging of a feature that developers have been pondering for years.

Comments (16 posted)

The present and future of formatted kernel documentation

By Jonathan Corbet
January 13, 2016

The kernel source tree comes with a substantial amount of documentation, believe it or not. Much of that can be found in the Documentation tree as a large set of rather haphazardly organized plain-text files. But there is also quite a bit of documentation embedded within the source code itself that can be extracted and presented in a number of formats. There has been an effort afoot for the better part of a year to improve the capabilities of the kernel's formatted-documentation subsystem; it's a good time for a look at the current state of affairs and where things might go.

Anybody who has spent much time digging around in the kernel source will have run across the specially formatted comments used there to document functions, structures, and more. These "kerneldoc comments" tend to look like this:

    /**
     * list_add - add a new entry
     * @new: new entry to be added
     * @head: list head to add it after
     *
     * Insert a new entry after the specified head.
     * This is good for implementing stacks.
     */

This comment describes the list_add() function and its two parameters (new and head). It is introduced by the "/**" marker and follows a number of rules; see Documentation/kernel-doc-nano-HOWTO.txt for details. Normal practices suggest that these special comments should be provided for all functions meant to be used outside of the defining code (all functions that are exported to modules, for example); some subsystems also use kerneldoc comments for internal documentation.

The documentation subsystem is able to extract these comments and render them into documents in a number of formats, including plain text, man pages, HTML, and PDF files. This can be done in a kernel source tree with a command like "make mandocs" or "make pdfdocs". There is also a copy of the formatted documentation on kernel.org; the end result for the comment above can be found on this page, for example. The results are not going to win any prizes for beautiful design, but many developers find them helpful.

Inside kernel-doc

The process of creating formatted documents starts with one of a number of "template files," found in the Documentation/DocBook directory. These files (there are a few dozen of them) are marked up in the DocBook format; they also contain a set of specially formatted (non-DocBook) lines marking the places where documentation from the source should be stuffed into the template. Thus, for example, kernel-api.tmpl contains a line that reads:

    !Iinclude/linux/list.h

The !I directive asks for the documentation for all functions that are not exported to modules. It is used rather than !E (which grabs documentation for exported functions) because the functions, being defined in a header file, do not appear in an EXPORT_SYMBOL() directive.

Turning a template file into one or more formatted documents is a lengthy process that starts with a utility called docproc, found in the scripts directory. This program (written in C) reads the template file, finds the special directives, and, for each of those directives, it does the following:

A pass through named source file is made, and each of the EXPORT_SYMBOL() directives found therein is parsed and the named function added to the list of exported symbols.
A call is made to scripts/kernel-doc (a 2,700-line Perl script) to locate all of the functions, structures, and more that are defined in the source file. kernel-doc tries to parse the C code well enough to recognize the definitions of interest; in the process, it attempts to deal with some of the kernel's macro trickery without actually running the source through the C preprocessor. It will output a list of the names it found.
docproc calls kernel-doc again, causing it to parse the source file a second time; this time, though, the output is the actual documentation for the functions of interest, with some minimal DocBook formatting added.

The formatted output is placed into the template file in the indicated spot. If the target format is HTML, the kernel-doc-xml-ref script is run to generate cross-reference links. This feature, only added in 4.3, can only generate links within one template file; cross-template links are not supported.

The final step is to run the documentation-formatting tool to actually create the files in the format of interest. Most of the time, the xmlto tool is used for this purpose, though there are some provisions in the makefile for using other tools.

In other words, this toolchain looks just like what one might expect from a documentation system written by kernel developers. It gets the basic job done but it is not particularly pretty or easy to use. It is somewhat brittle, making it easy for developers to break the documentation build without knowing it. Numerous developers have said that they have given up on trying to actually get formatted output from it; depending on one's distribution, getting all of the pieces is place is not always easy. And a lot of potentially desirable features, like cross-file links, indexing, or formatting within the in-source comments, are not present.

Formatted comments

The latter issue — adding formatting to the kerneldoc comments — has been the subject of some work in recent times. Daniel Vetter has a long-term goal of putting much more useful graphics-subsystem information into those comments, but has found the lack of formatting to be an impediment once one gets beyond documenting function prototypes. To fix that, Intel funded some work that, among other things, produced a patch set allowing markup in the comments. Nobody really wants to see XML markup in C source, though, so the patch took a different approach, allowing markup to be done using the Markdown language. Using Markdown allowed a fair amount of documentation to be moved to the source from the template file, shedding a bunch of ugly XML markup on the way.

This work has not yet been merged into the mainline. Daniel has his own hypothesis as to why:

Unfortunately it died in a bikeshed fest due to an alliance of people who think docs are useless and you should just read the code, and others who didn't even know how to convert the kerneldoc into something pretty.

Your editor (who happens to be the kernel documentation maintainer, incidentally), has a different hypothesis. Perhaps this work remains outside because: (1) it is a significant change affecting all kernel developers that shouldn't be rushed; (2) it used pandoc, requiring, on your editor's Fedora test box, the installation of 70 Haskell dependencies to run; (3) it had unresolved problems stemming from disagreements between pandoc and xmlto regarding things like XML entity escaping; and (4) a certain natural reluctance to add another step to the kernel documentation house of cards. All of these concerns led to a discussion at the 2015 Kernel Summit and a lack of enthusiasm for quick merging of this change.

All that notwithstanding, there is no doubt that there is interest in adding formatting to the kernel's documentation comments. Your editor thinks that there might be a better way to do so, perhaps involving the removal of xmlto (and DocBook) entirely in favor of a Markdown-only solution or a system like Sphinx. Unfortunately, your editor has proved to be thoroughly unable to find the time to actually demonstrate that such an approach might work, and nobody else seems ready to jump in and do it for him. Meanwhile, the Markdown patches have been reworked to use AsciiDoc (which can be thought of as a rough superset of Markdown) instead. That change gets rid of the Haskell dependency (replacing it with a Python dependency) and improves some formatting features at the cost of slowing the documentation build considerably. Even if it is arguably not the best solution, it is out there and working now.

As a result, these patches will probably be pulled into the documentation tree (and, thus, into linux-next) in the next few weeks, with an eye toward merging in 4.6 if all looks well. It has been said many times that a subsystem maintainer's first job is to say "no" to changes. Sometimes, though, the right thing is to say "yes," even if said maintainer thinks that a better solution might be possible. A good-enough solution that exists now should not be held up overly long in the hopes that vague ideas for something else might turn into real, working code.

Comments (16 posted)

Linus Torvalds Linux 4.4 ?

Alexandre Oliva GNU Linux-libre 4.4-gnu is now available ?

Kamal Mostafa Linux 4.2.8-ckt1 ?

Steven Rostedt 3.18.25-rt23 ?

Luis Henriques Linux 3.16.7-ckt22 ?

Steven Rostedt 3.14.58-rt59 ?

Jiri Slaby Linux 3.12.52 ?

Steven Rostedt 3.12.52-rt70 ?

Steven Rostedt 3.10.94-rt102 ?

Steven Rostedt 3.2.75-rt108 ?

Ard Biesheuvel arm64: implement support for KASLR ?

Jiancheng Xue ARM: hisi: Add initial support including clock driver for Hi3519 soc. ?

Yury Norov ILP32 for ARM64 ?

Joshua Henderson Initial Microchip PIC32MZDA Support ?

Dave Hansen x86: Memory Protection Keys (v8) ?

Andy Lutomirski x86/mm: PCID and INVPCID ?

Tony Luck Machine check recovery when kernel accesses poison ?

Jan Kara SYNC_NOIDLE preemption for ancestor cgroups ?

Juri Lelli CPUs capacity information for heterogeneous systems ?

Jessica Yu (mostly) Arch-independent livepatch ?

Ming Lei bpf: support percpu ARRAY map ?

Benjamin LaHaise aio: thread (work queue) based aio and new aio functionality ?

Frederic Weisbecker sched: Improve cpu load accounting with nohz ?

Dmitry Vyukov kernel: add kcov code coverage ?

Yakir Yang Add RK3229 HDMI support ?

Benoit Parrot media: ti-vpe: Add CAL v4l2 camera capture driver ?

Wenyou Yang regulator: act8945a: add regulator driver for the sub-device of ACT8945A MFD ?

Laxman Dewangan Add support for MAXIM MAX77620/MAX20024 PMIC ?

John Garry HiSilicon SAS v2 hw support ?

Cyrille Pitchen mtd: spi-nor: add driver for Atmel QSPI controller ?

richard.dorsch@gmail.com Add Advantech iManager EC driver set ?

Edwin Velds HID: Force feedback support for the Logitech G920 ?

Ricardo Ribalda Delgado iio: add ad5761 DAC driver ?

Wenyou Yang mfd: act8945a: add Active-semi ACT8945A PMIC MFD driver ?

Wenyou Yang power: act8945a: add charger driver for the sub-device of ACT8945A MFD ?

Philipp Zabel MT8173 DRM support ?

Keith Busch Driver for new "VMD" device ?

Mike Looijmans hwmon: Add LTC2990 sensor driver ?

Aleksey Makarov ACPI: amba bus probing support ?

Tomeu Vizoso Allow USB devices to remain runtime-suspended when sleeping ?

Andrew Lunn Support MDIO devices ?

Gregory CLEMENT Proposal for a API set for HW Buffer management ?

Deepa Dinamani Add 64 bit timestamp support ?

Dan Williams fs, block: handle end of life ?

Andreas Gruenbacher Richacls (Core and Ext4) ?

Eric Sandeen quota: add new quotactl Q_XGETQUOTA2 ?

Qu Wenruo Btrfs: Add inband (write time) de-duplication framework ?

Mimi Zohar vfs: support for a common kernel file loader (step 1) ?

Bob Peterson vfs: Expand iomap interface to fiemap ?

Li Xi ext4: add project quota support ?

Ross Zwisler DAX fsync/msync support ?

Matthew Wilcox Support for transparent PUD pages for DAX files ?

Jesper Dangaard Brouer MM: More bulk API work ?

Edward Cree Local Checksum Offload ?

Daniel Borkmann net, sched: add clsact qdisc ?

Jarno Rajahalme openvswitch: NAT support. ?

David Howells KEYS: Restrict additions to 'trusted' keyrings ?

Jari Ruusu Announce loop-AES-v3.7g file/swap crypto package ?

Tejun Heo cgroup, perf_event: make perf_event work on v2 hierarchy ?

Stephen Hemminger iproute2 4.4.0 ?

Kernel development

Brief items

Kernel release status

Kernel development news

The 4.5 merge window opens

Fixing asynchronous I/O, again

The present and future of formatted kernel documentation

Inside kernel-doc

Formatted comments

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous