Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.4-rc8, released on January 3. "Normally, me doing an eighth release candidate means that there is some unresolved issue that still needs more time to get fixed. This time around, it just means that I want to make sure that everybody is back from the holidays and there isn't anything pending, and that people have time to get their merge window pull requests all lined up. No excuses about how you didn't have time to get things done by the time the merge window opened, now."

Previously, 4.4-rc7 came out on December 27.

Stable releases: none have been released since December 14.

Comments (none posted)

Quotes of the week

DECL IFUNC PLT GOT ... Guess what I've been reading... I think compiler people have deeper stacks than I do.

— Darren Hart

If I have to go read other code in a completely different file just to determine the code is actually safe, then I consider that bad code.

— Dave Chinner

Comments (none posted)

On moving on from being a maintainer

By Jake Edge
January 6, 2016

The maintainer of a Linux subsystem has a large, and largely thankless, job to do. While reviewing patches is clearly technical in nature, much of the rest of the work is almost clerical—and it takes enough time that there may be little or no time for programming or other actively technical tasks. Thus, it is not a surprise to see that maintainers burn out over time and start looking for other work (in the kernel or elsewhere) to do. In fact, it is surprising that it doesn't happen more often. However, there is no clear path for relinquishing the maintainer role—and generally no succession plan—which can make the transition kind of tricky.

That scenario is currently playing out for the md (software RAID) subsystem, where maintainer Neil Brown has announced that he intends to step down on February 1. Brown got "sucked in" to being the md maintainer in late 2001 because there was no one else doing it. Since there is no "obvious candidate for replacement maintainer - no one who has already been doing significant parts of the maintainer role", he intends to create a maintainership vacuum in the hopes that one or more folks step up to fill the role.

He laments that he has not been able to attract additional maintainers, though he noted that there are some folks in the md community who are certainly capable of doing the job. The question, in Brown's eyes, is whether or not they "care about the code and the subsystem", which is something that only individuals can determine for themselves. That means he doesn't feel in a position to appoint anyone to the role and would like to see folks volunteer. By stepping down, he hopes that might create a little pressure to step up.

As he noted, Linus Torvalds has expressed a preference for small maintainer teams, which might make sense for md. Another alternative might be for the device mapper (dm) maintainer team to add md maintainer duties. Beyond just md, though, he is also relinquishing the maintainer role for the mdadm administrative tool. That could be handled by the new md maintainer or team, though he would prefer to see different people maintain md and mdadm. According to Brown (in response to an email query), there are two main reasons he favors that separation: it worked well when he handed off nfsd to Bruce Fields and nfs-utils to Steve Dickson, but also that "it encourages public accountability - it is too easy for me to make an API change to md, start using it in mdadm, and not have anyone review it".

Brown's announcement describes the responsibilities of a maintainer as he sees them:

So I'm hoping to get one or more volunteers to be maintainer:

to gather and manage patches and outstanding issues,
to review patches or get them reviewed
to follow up bug reports and get them resolved
to feed patches upstream, maybe directly to Linus, maybe through some other maintainer, depending on what relationships already exist or can be formed,
to guide the longer term direction (saying "no" is important sometimes),
to care,

but also to be aware that maintainership takes real effort and time, as does anything that is really worthwhile.

As can be seen, there is a great deal to do. He also noted that another job he had previously spent a lot of time on, following the linux-raid mailing list to provide support on md-related issues, has fallen by the wayside for him. But, in what might be a preview of what will happen with the maintainer role, others in the md community have stepped up. He is "absolutely thrilled that the gap has been more than filled by other very competent community members".

Though he soon won't be doing the maintainer's job, Brown is not disappearing from the md world. He has committed to continue work on the raid5-journal and raid1-cluster projects. He would also be willing to mentor any volunteers and will still review some patches as well as comment on designs. He concluded with a call to action:

Take the bull by the horns and start *being* a maintainer(team). I won't get in your way and I'll help where I can.

Certainly Brown is not the only maintainer to find that they have tired of doing that job. Back at the end of 2014, John Linville stepped down as the wireless network maintainer by "promoting" some of the subsystem maintainers and handing off the wireless driver patch handling to Kalle Valo. The mac80211, bluetooth, and nfc maintainers were asked to start pushing their patches directly to network maintainer David Miller, rather than going through Linville's tree. It seems that Linville had been more successful in finding maintainers along the way—or in them finding him—which simplified his decisions when he decided to work on other things. The wireless subsystem is rather larger than md, however, which tends to foster a bigger pool of potential maintainers.

As with other parts of the kernel development process, the maintainership role is a bit haphazard. Maintainers handle their duties as they see fit and focus their efforts in different ways. The main job is to get the right patches in a—hopefully—timely manner to Torvalds and into the mainline. Determining which patches are "right" is part of the job, too, of course, but some maintainers (including Torvalds) largely leave that job to their sub-maintainers, while others do not. Some of that can be seen in our article on how patches get to the mainline.

In most cases, the maintainer's style has likely come about organically over time—certain things seemed to work for them. But that style may impact how a transition out of the role will need to be handled. For md, there may be some folks interested in the maintainer job (or, more likely, team), who spoke up in the short thread. While it may seem a little crazy to those outside the kernel development community, creating a vacuum as an exit strategy may actually work better than other mechanisms—at least for some subsystems and maintainers.

Comments (13 posted)

Protecting private structure members

By Jonathan Corbet
January 6, 2016

Most languages designed in the last few decades offer a way to restrict access to portions of a data structure, limiting their use to the code that directly implements the object that structure is meant to represent. The C language, initially designed in 1972, lacks any such feature. Most of the time, C (along with the projects using it) muddles along without this kind of protective feature. But that doesn't mean there would not be a use for it.

If one browses through the kernel code, it's easy to find comments warning of dire results should outside code touch certain structure fields. The definition of struct irq_desc takes things a bit further, with a field defined as:

    unsigned int core_internal_state__do_not_mess_with_it;

Techniques like this work most of the time, but it would still be nice if the computer could catch accesses to structure members by code that should have no business touching them.

Adding that ability is the goal of this patch set from Boqun Feng. It takes advantage of the fact that the venerable sparse utility allows variables to be marked as "not to be referenced." That feature is used primarily to detect kernel code that directly dereferences user-space pointers, but it can also be used to catch code that is messing around with structure members that it has not been invited to touch. Not all developers routinely run sparse, but enough do that new warnings would eventually be noticed.

The patch set adds a new __private marker that can be used to mark private structure members. So the above declaration could become:

    unsigned int __private core_internal_state__do_not_mess_with_it;

As far as the normal C compiler is concerned, __private maps to the empty string and changes nothing. But when sparse is run on the code, it notes that the annotated member is not meant to be accessed and will warn when anybody tries.

Of course, some code must be able to access that field, or there is little point in having it there. Doing so without generating a sparse warning requires first removing the __private annotation; that is done by using the ACCESS_PRIVATE() macro. So code that now looks like:

    foo = s->private_field;

would have to become:

    foo = ACCESS_PRIVATE(s, private_field);

This aspect of the patch could prove to be the sticking point: some code may require a large number of ACCESS_PRIVATE() casts. Whether they are added to the code directly or hidden behind helper functions, they could lead to a fair amount of code churn if this feature is to be widely used. Given that the honor system works most of the time and that problems from inappropriate accesses to private data are relatively rare, the development community may decide that the current system works well enough.

Comments (8 posted)

The return of preadv2()/pwritev2()

By Jonathan Corbet
January 6, 2016

Back in 2014, Milosz Tanski tried to add a flags argument to the preadv() and pwritev() system calls. At the time, your editor suggested that the patch set might be ready for the 3.19 development cycle. It might have been, but things stalled and the work was never actually merged. Now this idea is back, but with a different end goal in mind.

The existing preadv() and pwritev() system calls (along with readv() and writev(), which are simpler versions of them) lack any means to pass operation-specific options into the kernel. So there is no way to change how a specific call works. Milosz's specific need was to be able to turn on non-blocking behavior for some, but not all, operations; this is a feature that appears to have a number of use cases. The idea was reasonably well received at the (March) 2015 Linux Filesystem, Storage and Memory Management Summit. Even so, work on these patches seemed to come to a halt, with the last posted version showing up in March.

Now, however, preadv2() and pwritev2() are back, though with a different use case in mind. This patch set, posted by Christoph Hellwig, still introduces the new system calls, which still look like:

    int preadv2(unsigned long fd, struct iovec *vec, unsigned long vlen,
                unsigned long pos_l, unsigned long pos_h, int flags);
    int pwritev2(unsigned long fd, struct iovec *vec, unsigned long vlen,
                 unsigned long pos_l, unsigned long pos_h, int flags);

(Note that the system calls, as presented by the C library, would almost certainly be a little different, with pos_l and pos_h being combined into a single, 64-bit position value).

Unlike Milosz's patch set, though, Christoph's does not provide for non-blocking operations. Instead, it provides a different flag (RWF_HIPRI) allowing an application to indicate a high-priority operation. The block layer can then use that flag to decide whether it should use the new block-layer polling mechanism with that request or not. Polling can, for fast devices (non-volatile memory, for example), reduce latencies significantly. But polling has its costs as well; it probably is best used only when an application is concerned about cutting latency to the bare minimum. Without a flag like RWF_HIPRI, the kernel can't really know if the application cares about latency or not.

Christoph hasn't forgotten about the non-blocking read use case; he also mentions other possibilities like per-operation synchronous behavior or use of the DIF/DIX data integrity mechanism. But his first priority at this point is to get the new system calls in place, along with the polling feature. Once that has been done, there are plenty of other flags that can be added.

Comments (1 posted)

How 4.4's patches got to the mainline

By Jonathan Corbet
January 6, 2016

The kernel development community is organized as a hierarchy, with developers submitting patches to maintainers who will, in turn, commit those patches to a repository and push them upstream to higher-level maintainers. This hierarchy logically looks a lot like the directory hierarchy of the kernel source itself; most maintainers look after one or more subtrees of the kernel source tree. But does that model really describe how patches make it into the mainline? The kernel's git repository, with the aid of some scripting, holds an answer to that question.

With one exception, the process of pulling patches from one repository to another leaves a sign in the form of a merge commit. Those merge commits stay with their associated patches as they are pulled into subsequent repositories, eventually leaving clues to the pull history in the mainline repository. By working through the history and finding the merge that pulled in each patch, one can work out one plausible path by which each patch got to the mainline. The process takes a while to run and tends to make one's laptop warm up, but it produces interesting results in the end.

(A note for the curious: the one exception mentioned above is "fast-forward" merges, where the destination repository has not changed since the source repository diverged from it. Some projects fear merge commits and insist that all merges be fast-forward merges, but that policy causes the loss of some useful information. In any case, a no-merges policy would be difficult to scale to a project the size of the kernel. Fast-forward merges are rare in the kernel community, and almost never happen for merges into the mainline.)

The result of running this analysis is the plot shown to the right; click on the image to see the plot in its full, 2.1MB glory.

An aphorism occasionally heard among kernel developers is "design in layers, implement flat." It reflects the learned wisdom that layering is a useful design and abstraction tool, but excessive layering in implementations tends to lead to overhead and poor performance. This plot suggests that the kernel development community itself grew as if it were designed with this same heuristic in mind. The kernel source tree is a multi-layer hierarchy, and the maintainers are theoretically organized in the same way, but, in the end, almost every maintainer pushes patches directly to Linus and, thus, directly into the mainline repository. Most of the time, there are no intermediaries between subsystem maintainers and Linus.

Why are things organized that way? One reason is clearly to minimize the latency built into the system; once changes are committed by a maintainer, they can get to the mainline quickly if need be. This organization breaks pull requests into (mostly) manageable pieces that Linus can look over directly, allowing him to maintain some idea of what is happening in all parts of the kernel. And, importantly, it reflects the fact that Linus feels he can trust a fairly large number of maintainers to not sneak questionable changes into a pull request. He relies heavily on subsystem maintainers to properly review changes from developers, but he does not need higher-level maintainers to review the work the subsystem maintainers are doing.

Clearly, such a system will only work if that trust is maintained. Equally important is the ability for Linus to be able to manage pull requests from that many maintainers. Those who have been watching the kernel community for a long time will remember the frightening process-scalability crises that occurred regularly prior to the introduction of BitKeeper (and the subsequent switch to Git). Over five years ago, when kernel development cycles still ran under 10,000 changes and involved a maximum of 1,200 developers, we asked whether Linus was reaching a scalability limit. At the beginning of 2015, cycles run more quickly, bring in 13,000 changes, and will soon involve 1,600 developers, and there are no real signs of strain.

It is good to know, though, that the process would easily accommodate spreading out the top-level responsibility if need be — should Linus get overwhelmed or simply step aside in favor of somebody else. He has advocated in favor of maintainer groups for subsystems; at some point, perhaps we'll have a maintainer group for the top-level repository as well.

The two trees that feed the most patches to the mainline are interesting in that they show two different maintainer styles. The most active tree in 4.4 was, as it often is, the staging tree, run by Greg Kroah-Hartman. 2,454 changes went through the staging tree in this cycle, but ~~not a single one~~ only 122 of them were merged from another repository; Greg applied each of the other 2,332 patches himself. That's 35 patches applied each day over the course of the entire 70-day development cycle. Like many subsystem maintainers, Greg would rather see patches posted to (and applied from) public mailing lists rather than coming directly from other repositories.

The other tree at the top consists of David Miller's networking trees ("net" and "net-next") which, together, sent 2,276 patches into the mainline. The networking developers use the deepest hierarchy of any kernel subsystem, with a large percentage of the patches moving into David's tree from some other subsystem tree. The style of this group is also to use separate repositories for development ("net-next," for example) and for fixes ("net"), while other subsystems tend to put more things into the same repository, using branches to organize them. Thus, for example, the "tip" repository (with x86 and core-kernel work) and the arm-soc repository (numerous ARM-architecture topics) each generated numerous large pull requests during this development cycle, but each shows as a single tree in this plot. One could separate these streams by looking at the name of the branches pulled from, at the risk of adding a fair amount of noise to the plot.

Attentive readers may have wondered at the use of the term "one plausible path" in the description of the algorithm at the top of this article. [treeplot snippet] Consider the small piece of the plot shown to the right; it shows a single commit flowing from Mark Brown's "regmap" repository toward net-next. That flow represents this merge commit, wherein David pulled a single change from the regmap repository. When Linus pulled net-next, he will have gotten that change with the rest. But that same commit was also a part of this merge by Linus which brought the rest of the regmap work directly into the mainline repository. At this point, the repository history shows that fix as having come via the latter merge, but the former merge remains in the history as well. More complicated patterns can be found, especially when developers perform "back merges" of a higher-level tree into their own repositories. Such merges are discouraged unless there is a good reason, partly because they tend to obscure the commit history.

Doubtless there are other interesting things to be learned by watching how changes make their way through the kernel development community and its repositories. For those who are interested in looking further, the tools used to create this plot can be found in the gitdm repository: git://git.lwn.net/gitdm.git.

[Note that the plots have been updated to fix a mysterious but egregious error; see the comments for details.]

Comments (9 posted)

Linus Torvalds Linux 4.4-rc8 ?

Linus Torvalds Linux 4.4-rc7 ?

Sebastian Andrzej Siewior 4.4-rc6-rt1 ?

Ben Hutchings Linux 3.2.75 ?

Ard Biesheuvel generic relative extable support ?

Noam Camus Adding plat-eznps to ARC ?

Jens Wiklander ARM SMC Calling Convention interface ?

Ard Biesheuvel arm64: implement support for KASLR ?

Mark Rutland arm64: mm: rework page table creation ?

Tony Luck Machine check recovery when kernel accesses poison ?

Christopher S. Hall Patchset enabling hardware based cross-timestamps for next gen Intel platforms ?

Chris Metcalf support "task_isolation" mode for nohz_full ?

serge.hallyn@ubuntu.com CGroup Namespaces (v9) ?

Tejun Heo [PATCHSET REPOST] sched, cgroup: implement cgroup v2 interface for cpu controller ?

Toshi Kani resource: Add System RAM resource type ?

Daniel Lezcano IRQ based next prediction ?

Boqun Feng sparse: Introduce __private to privatize members of structs ?

João Paulo Rechi Vita Asus Wireless Radio Control driver ?

Bjorn Andersson Qualcomm WCNSS Peripheral Image Loader ?

Sinan Kaya dma: add Qualcomm Technologies HIDMA driver ?

Biao Huang pinctrl: mediatek: add pinctrl/GPIO/EINT driver for mt2701 ?

James Liao Add clock support for Mediatek MT2701 ?

Tiffany Lin Add MT8173 Video Encoder Driver and VPU Driver ?

Philipp Zabel MT8173 DRM support ?

Jiancheng Xue ARM: hisi: Add initial support including clock driver for Hi3519 soc. ?

Gilad Avidov net: emac: emac gigabit ethernet controller driver ?

Andrey Utkin Add tw5864 driver ?

Chen Feng Add Support for Hi6220 PMIC Hi6553 MFD Core ?

Andrew Lunn Support MDIO devices ?

Boris Brezillon drm: bridge: sil902x ?

Ludovic Desroches Introduce at91_adc8xx driver ?

Luis R. Rodriguez firmware_class: extensible firmware API ?

Baolin Wang Introduce usb charger framework to deal with the usb gadget power negotation ?

Joonsoo Kim Introduce new async/sync compression APIs ?

Nobuo Iwata usbip: features to USB over WebSocket ?

Michael Kerrisk (man-pages) man-pages-4.04 is released ?

Thomas Schoebel-Theuer Current state of MARS ?

Al Viro ->get_link(), ->put_link() and cookies ?

Christoph Hellwig selective block polling and preadv2/pwritev2 revisited ?

Qu Wenruo Btrfs: Add inband (write time) de-duplication framework ?

Seth Forshee Support fuse mounts in user namespaces ?

Dan Williams fs, bdev: handle end of life ?

Ross Zwisler DAX fsync/msync support ?

Matthew Wilcox Support for transparent PUD pages ?

Cyrill Gorcunov mm: Rework virtual memory accounting ?

Michal Hocko oom reaper v4 ?

Craig Gallek Faster SO_REUSEPORT ?

David Ahern net: Add l3mdev cgroup ?

Mimi Zohar ima: measuring/appraising files read by the kernel ?

Jiri Olsa perf stat: Add scripting support ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

On moving on from being a maintainer

Protecting private structure members

The return of preadv2()/pwritev2()

How 4.4's patches got to the mainline

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous