Kernel development
Brief items
Kernel release status
The current development kernel is 3.18-rc7, released on November 30. Linus seems happy enough, despite the persistent lockup problem that has defied all debugging attempts so far. "At the same time, with the holidays coming up, and the problem _not_ being a regression, I suspect that what will happen is that I'll release 3.18 on time in a week, because delaying it will either mess up the merge window and the holiday season, or I'd have to delay it a *lot*."
3.18-rc6 was released on November 23.
Stable updates: 3.10.61, 3.14.25, and 3.17.4 were released on November 21.
Quotes of the week
static inline void * someone_think_of_a_name_for_this(gfp_t gfp_mask, unsigned int order) { return (void *)__get_free_pages(gfp, order); }
McKenney: Stupid RCU Tricks: rcutorture Catches an RCU Bug
On his blog, Paul McKenney investigates a bug in read-copy update (RCU) in preparation for the 3.19 merge window. "Of course, we all have specific patches that we are suspicious of. So my next step was to revert suspect patches and to otherwise attempt to outguess the bug. Unfortunately, I quickly learned that the bug is difficult to reproduce, requiring something like 100 hours of focused rcutorture testing. Bisection based on 100-hour tests would have consumed the remainder of 2014 and a significant fraction of 2015, so something better was required. In fact, something way better was required because there was only a very small number of failures, which meant that the expected test time to reproduce the bug might well have been 200 hours or even 300 hours instead of my best guess of 100 hours."
Version 2 of the kdbus patches posted
The second version of the kdbus patches have been posted to the Linux kernel mailing list by Greg Kroah-Hartman. The biggest change since the original patch set (which we looked at in early November) is that kdbus now provides a filesystem-based interface (kdbusfs) rather than the /dev/kdbus device-based interface. There are lots of other changes in response to v1 review comments as well. "kdbus is a kernel-level IPC implementation that aims for resemblance to [the] protocol layer with the existing userspace D-Bus daemon while enabling some features that couldn't be implemented before in userspace."
Kernel development news
ACCESS_ONCE() and compiler bugs
The ACCESS_ONCE() macro is used throughout the kernel to ensure that code generated by the compiler will access the indicated variable once (and only once); see this article for details on how it works and when its use is necessary. When that article was written (2012), there were 200 invocations of ACCESS_ONCE() in the kernel; now there are over 700 of them. Like many low-level techniques for concurrency management, ACCESS_ONCE() relies on trickery that is best hidden from view. And, like such techniques, it may break if the compiler changes behavior or, as has been seen recently, contains a bug.Back in November, Christian Borntraeger posted a message regarding the interactions between ACCESS_ONCE() and an obscure GCC bug. To understand the problem, it is worth looking at the macro, which is defined simply in current kernels (in <linux/compiler.h>):
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
In short, ACCESS_ONCE() forces the variable to be treated as being a volatile type, even though it (like almost all variables in the kernel) is not declared that way. The problem reported by Christian is that GCC 4.6 and 4.7 will drop the volatile modifier if the variable passed into it is not of a scalar type. It works fine if x is an int, for example, but not if x has a more complicated type. For example, ACCESS_ONCE() is often used with page table entries, which are defined as having the pte_t type:
typedef struct { unsigned long pte; } pte_t;
In this case, the volatile semantics will be lost in buggy compilers, leading to buggy kernels. Christian started by looking for ways to work around the problem, only to be informed that normal kernel practice is to avoid working around compiler bugs whenever possible; instead, the buggy versions should simply be blacklisted in the kernel build system. But 4.6 and 4.7 are installed on a lot of systems; blacklisting them would inconvenience many users. And, as Linus put it, there can be reasons for approaches other than blacklisting:
One way of being less fragile would be to change the affected ACCESS_ONCE() calls to point to the scalar parts of the relevant non-scalar types. So, if code does something like:
pte_t p = ACCESS_ONCE(pte);
It could be changed to something like:
unsigned long p = ACCESS_ONCE(pte->pte);
This type of change requires auditing all ACCESS_ONCE() calls, though, to find the ones using non-scalar types; that would be a lengthy and error-prone process that would not prevent the addition of new bugs in the future.
Another approach to the problem explored by Christian was to remove a number of problematic ACCESS_ONCE() calls and just put in a compiler barrier with barrier() instead. In many cases, a barrier is sufficient, but in others it is not. Once again, a detailed audit is required, and there is nothing preventing new code from adding buggy ACCESS_ONCE() calls.
So Christian headed down the path of changing ACCESS_ONCE() to simply disallow the use of non-scalar types altogether. In the most recent version of the patch set, ACCESS_ONCE() looks like this:
#define __ACCESS_ONCE(x) ({ \ __maybe_unused typeof(x) __var = 0; \ (volatile typeof(x) *)&(x); }) #define ACCESS_ONCE(x) (*__ACCESS_ONCE(x))
This version will cause compilation failures if a non-scalar type is passed into the macro. But what about the situations where a non-scalar type needs to be used? For these cases, Christian has introduced two new macros, READ_ONCE() and ASSIGN_ONCE(). The definition of the former looks like this:
static __always_inline void __read_once_size(volatile void *p, void *res, int size) { switch (size) { case 1: *(u8 *)res = *(volatile u8 *)p; break; case 2: *(u16 *)res = *(volatile u16 *)p; break; case 4: *(u32 *)res = *(volatile u32 *)p; break; #ifdef CONFIG_64BIT case 8: *(u64 *)res = *(volatile u64 *)p; break; #endif } } #define READ_ONCE(p) \ ({ typeof(p) __val; __read_once_size(&p, &__val, sizeof(__val)); __val; })
Essentially, it works by forcing the use of scalar types, even if the variable passed in does not have such a type. Providing a single access macro that worked on both the left-hand and right-hand sides of an assignment turned out to not be trivial, so the separate ASSIGN_ONCE() was provided for the left-hand side case.
Christian's patch set replaces ACCESS_ONCE() calls with READ_ONCE() or ASSIGN_ONCE() in cases where the latter are needed. Comments in the code suggest that those macros should be preferred to ACCESS_ONCE() in the future, but most existing ACCESS_ONCE() calls have not been changed. Developers using ACCESS_ONCE() to access non-scalar types in the future will get an unpleasant surprise from the compiler, though.
This version of the patch has received few comments and seems likely to make it into the mainline in the near future; backports to the stable series are also probably on the agenda. There are times when it is best to simply avoid versions of the compiler with known bugs altogether. But, as can be seen here, compiler bugs can also be seen as a signal that things could be done better in the kernel, leading to more robust code overall.
Splicing out syscalls for tiny kernels
It is no secret that the Linux kernel has grown over time; the constant addition of features and hardware support means that almost every development cycle adds more code than it removes. The good news is that, for most of us, the increase in hardware speed and size has far outstripped the growth of the kernel, so few of us begrudge the extra resources that a larger kernel requires. Developers working on tiny systems, though, are still concerned about every byte consumed by the kernel. Accommodating their needs seems likely to be a source of ongoing stress in the community.The latest example comes from Pieter Smith's patch set to remove support for the splice() family of system calls, including sendfile() and tee(). There will be many tiny systems with dedicated applications that have no need for those calls; removing them from the kernel makes 8KB of memory available for other purposes. The Linux "tinification" developers see that as a worthwhile gain, but some others disagree.
In particular, David Miller opposed the
change, saying "I think starting to compile out system calls is a
very slippery slope we should not begin the journey down.
" He
worries that, even if a specific system works today without
splice(), there may be a surprise tomorrow when some library
starts using that system call. Developers working on Linux systems, David
appears to be arguing, should be able to count on having the basic system
call set available to them anywhere.
The tinification developers have a couple of answers to this concern. One is that developers working on tiny systems know what they are doing and which system calls they can do without. As Josh Triplett put it:
The other response is that the kernel has, in fact, provided support for compiling out major subsystems since the beginning. Quoting Josh again:
(This list goes on for some time; see the original mail for all the details). Eric Biederman added that the SYSV IPC system calls have been optional for a long time, and Alan Cox listed more optional items as well. David finally seemed to concede that making system calls optional was not a new thing for the Linux kernel, but he stopped short of actually supporting the splice() removal patch.
Without his opposition, though, this patch may go in. But a look at the kernel tinification project list makes it clear that this discussion is likely to return in the future. The tinification developers would like to be able to compile out support for SMP systems, random number generation, signal handling, capabilities, non-root users, sockets, the ability for processes to exit, and more. Eventually, they would like to have an automated tool that can examine a user-space image and build a configuration removing every system call that the given programs do not use.
Needless to say, any kernel that has been stripped down to that extent will not resemble a contemporary Linux system. But, on the other hand, neither do the ancient (but much smaller) kernels that these users often employ now. If Linux wants to have a place on tiny systems, the kernel will have to adapt to the resource constraints that come with such systems. That will bring challenges beyond convincing developers to allow important functionality to be configured out; the tinification developers will also have to figure out a way to allow this configuration without introducing large numbers of new configuration options and adding complexity to the build system.
It looks like a hard line to walk. But the Linux kernel embodies the solution to a lot of hard problems already; where there are willing developers, there is usually a way. If the tinification developers can find a way here, Linux has a much better chance of being present on the tiny systems that are likely to be embedded in all kinds of devices in the coming years. That seems like a goal worth trying for.
Version 2 of the kdbus patch set
When the long-awaited kdbus patch set hit linux-kernel at the end of October, it ran into a number of criticisms from reviewers. Some developers might have given up in discouragement, muttering about how unfriendly the kernel development community is. The kdbus developers know better than that, though. This can be seen in the version 2 posting; the code has changed significantly in response to the comments that were received the first time around. Kdbus may still not be ready for immediate inclusion into the mainline, but it does seem to be getting closer.
No more device files
One of the biggest complaints about the first version was its use of device files to manage interaction with the system. Devices need to be named; that forced a hierarchical global naming system on kdbus domains — which were otherwise not inherently hierarchical. The global namespace imposed a privilege requirement, making it harder for unprivileged users to create kdbus domains; it also added complications for users wanting to checkpoint and restore containers.
The second version does away with the device abstraction, replacing it with a virtual filesystem called "kdbusfs." This filesystem will normally be mounted under /sys/fs/kdbus. Creating a new kdbus domain (a container that holds a namespace for one or more buses) is simply a matter of mounting an instance of this filesystem; the domain will persist until the filesystem is unmounted. No special privileges are needed to create a new domain — but mounting a filesystem still requires privileges of its own.
A newly created domain will contain no buses at the outset. What it does have is a file called control; a bus can be created by opening that file and issuing a KDBUS_CMD_BUS_MAKE ioctl() command. That bus will remain in existence as long as the file descriptor for the control file is held open. Only one bus may be created on any given control file descriptor, but the control file can be opened multiple times to create multiple buses. The control file can also be used to create custom endpoints for well-known services.
Each bus is represented by its own directory underneath the domain directory; endpoints are represented as files within the bus directory. Connecting to a bus is a matter of opening the kdbusfs file corresponding to the desired endpoint; for most clients, that will be the file simply called bus. Messages can then be sent and received with ioctl() commands on the resulting file descriptor.
As can be seen, the device abstraction is gone, but the interface is still somewhat device-like in that it is heavily based on ioctl() calls. There has been a small amount of discussion on whether it might make more sense to just use operations like read() and write() to interact with kdbus, but there appears to be little interest in making (or asking for) that sort of change.
Metadata issues
A significant change that has been made is in the area of security. In version 1, the recipient of a message could specify a set of credential information that must accompany the message. This information can include anything from the process ID through to capabilities, command line information, audit information, security IDs, and more. Some reviewers (Andy Lutomirski in particular) complained that this approach could lead to information leaks and, maybe, worse security issues; instead, they said, the sender of a message should be in control of the metadata that goes along with the message.
The updated patch set contains a response to that request by changing the protocol. When a client connects to the bus, it runs the KDBUS_CMD_HELLO ioctl() command to set up a number of parameters for the connection; one of those parameters is now a bitmask describing which metadata can be sent with messages. It is possible for the creator of the bus to specify a minimum set of metadata to go with messages, though; in that case, a client refusing to send that metadata will not be allowed to connect to the bus.
There is still some disagreement over which metadata should be sent, whether it's optional or not. Andy disagrees with providing command-line (and related) information, on the basis that it can be set by the process involved and thus carries no trustworthy information. This metadata is evidently used mostly for debugging purposes; Andy suggests that it should just be grabbed out of /proc instead. He is also opposed to the sending of capability information, noting that capabilities are generally problematic in Linux and their use should not be encouraged.
One other interesting bit of metadata that can be attached to messages is
the time that the sending process started
executing. It is there to prevent race conditions associated with the
reuse of process IDs, which can happen quickly on a busy system. Andy
dislikes that approach, noting that it will not work well with either
namespaces or checkpointing. He prefers instead his own "highpid" solution. This patch adds a second,
64-bit, unique number associated with each process; interested programs can
then detect process ID reuse by seeing if that number changes. Eric
Biederman disagreed with that approach,
saying "What we need are not race free pids, but a file descriptor
based process management api.
" Andy was
not opposed to that idea, but he would like to see something simple
that can be of use to kdbus now.
Andy had a number of other comments, including pointing out a couple of places where, he
contended, he could use kdbus to gain root access on any system where it
was installed. Even so, he seems happy
with the direction the code is going, saying "And thanks for
addressing most of the issues. The code is starting to look much better to
me.
"
Toward the mainline
In theory, resolving the remaining issues should be relatively straightforward, though it is not hard to see the "highpid" idea running into resistance at some point. But the number of reviewers for the second kdbus posting has been relatively small, perhaps as a result of the holidays in the US. The addition of a significant core API of this type requires more attention than kdbus has gotten so far. That suggests that there may still be significant issues that have not yet been raised by reviewers. Kdbus is getting closer to mainline inclusion, but it may well take a few more development cycles to get to a point where most developers are happy with it.
Some 3.18 development statistics
As of the 3.18-rc6 release, 11,186 non-merge changesets have been pulled into the mainline repository for the 3.18 development cycle. That makes this release about 1,000 changesets smaller than its immediate predecessors, but still not a slow development cycle by any means. Since this cycle is getting close to its end, it's a good time to look at where the code that came into the mainline during this cycle came from. (For those who are curious about what changes were merged, see 3.18 Merge window, part 1, part 2, and part 3).1,428 developers have contributed code to the 3.18 release — about normal for the last year or so. The most active developers were:
Most active 3.18 developers
By changesets H Hartley Sweeten 237 2.1% Mauro Carvalho Chehab 179 1.6% Ian Abbott 162 1.4% Geert Uytterhoeven 121 1.1% Hans Verkuil 100 0.9% Ville Syrjälä 98 0.9% Navin Patidar 98 0.9% Sujith Manoharan 83 0.7% Johan Hedberg 82 0.7% Eric Dumazet 77 0.7% Lars-Peter Clausen 75 0.7% Antti Palosaari 72 0.6% Fabian Frederick 71 0.6% Daniel Vetter 70 0.6% Florian Fainelli 70 0.6% Felipe Balbi 70 0.6% Benjamin Romer 68 0.6% Laurent Pinchart 64 0.6% Andy Shevchenko 62 0.6% Malcolm Priestley 61 0.5%
By changed lines Larry Finger 74831 10.2% Greg Kroah-Hartman 73298 10.0% Hans Verkuil 22266 3.0% Alexander Duyck 16617 2.3% Greg Ungerer 11981 1.6% Linus Walleij 10628 1.5% John L. Hammond 10269 1.4% Navin Patidar 8148 1.1% Philipp Zabel 7149 1.0% Martin Peres 6890 0.9% Mark Einon 6771 0.9% Mauro Carvalho Chehab 6520 0.9% Ian Munsie 5773 0.8% H Hartley Sweeten 5134 0.7% Alexei Starovoitov 4505 0.6% Yan, Zheng 4485 0.6% Antti Palosaari 4181 0.6% Roy Spliet 3785 0.5% Christoph Hellwig 3765 0.5% Juergen Gross 3745 0.5%
As is usually the case, H. Hartley Sweeten tops the by-changesets list with the epic task of getting the Comedi drivers into shape in the staging tree. Mauro Carvalho Chehab, the Video4Linux2 maintainer, did a lot of cleanup work in that tree as well during this cycle, while Ian Abbott's changes were, once again, applied to the Comedi drivers. Geert Uytterhoeven did a lot of work in the ARM and driver trees, while Hans Verkuil also made a lot of improvements to the core Video4Linux2 subsystem.
On the "lines changed" side, Larry Finger removed the r8192ee driver from the staging tree, while Greg Kroah-Hartman removed two other drivers from staging. Alexander Duyck added the "fm10k" driver for Intel FM10000 Ethernet switch host interfaces, and Greg Ungerer removed a bunch of old m68k code.
Some 200 companies (that we were able to identify) supported development on the code merged for 3.18. The most active of those were:
Most active 3.18 employers
By changesets (None) 1244 11.0% Intel 1238 10.9% Red Hat 863 7.6% (Unknown) 828 7.3% Samsung 523 4.6% Linaro 370 3.3% IBM 340 3.0% SUSE 326 2.9% 324 2.9% (Consultant) 321 2.8% Freescale 238 2.1% FOSS Outreach Program for Women 238 2.1% Vision Engraving Systems 237 2.1% Texas Instruments 199 1.8% Renesas Electronics 179 1.6% MEV Limited 162 1.4% Free Electrons 155 1.4% Qualcomm 141 1.2% Oracle 135 1.2% ARM 114 1.0%
By lines changed (None) 185247 25.3% Linux Foundation 73354 10.0% Intel 73168 10.0% (Unknown) 28460 3.9% Cisco 27939 3.8% Red Hat 27335 3.7% Linaro 23586 3.2% Samsung 19228 2.6% IBM 18194 2.5% SUSE 16736 2.3% 14110 1.9% (Consultant) 12455 1.7% Accelerated Concepts 11986 1.6% Texas Instruments 11305 1.5% C-DAC 8400 1.1% Pengutronix 8232 1.1% Freescale 7265 1.0% (Academia) 7076 1.0% Qualcomm 5398 0.7% Code Aurora Forum 5377 0.7%
(Note that the above table has been updated; the curious can see the original version published on this page here).
As is often the case, there are few surprises here. The level of contributions from developers working on their own time remains steady at about 11%, a level it has maintained since the 3.13 kernel. So it might be safe to say that, for now, the decline in volunteer contributions appears to have leveled out.
How important are volunteer contributions to the Linux kernel? Many kernel developers started that way, so it is natural to think that a decline in volunteers will lead, eventually, to a shortage of kernel developers overall. As it happens, the period starting with the 3.13 release (roughly calendar year 2014) saw first-time contributions from 1,521 developers. Looking at who those developers worked for yields these results:
Employer Developers (Unknown) 651 (None) 137 Intel 115 37 Samsung 35 Huawei 33 IBM 32 Red Hat 25 Freescale 21 Linaro 17
All told, 733 first-time developers were identifiably working for some company or other when their first patch was accepted into the mainline. A large portion of the unknowns above are probably volunteers, so one can guess that a roughly equal number of first-time developers were working on their own time. So roughly half of our new developers in the last year were volunteers.
The picture changes a little, though, when one narrows things down to first-time developers who contributed to more than one release. When one looks at developers who contributed to three out of the last five releases, the picture is:
Employer Developers (Unknown) 48 Intel 24 (None) 21 Huawei 10 IBM 7 Samsung 6 Outreach Program for Women 6 ARM 4 Linaro 4 Red Hat 3 Broadcom 3
Overall, 126 new developers contributing to at least three releases in the last year worked for companies at the time of their first contribution — rather more than the number of volunteers. So it seems fair to say that a lot of our new developers are getting their start within an employment situation, rather than contributing as volunteers then being hired.
Where are these new developers working in the kernel? If one looks at all new developers, the staging tree comes out on top; 301 developers started there, compared to 122 in drivers/net, the second-most popular starting place. But the most popular place for a three-version developer to make their first contribution is in drivers/net; 25 new developers contributed there, while 20 contributed within the staging tree. So, while staging is arguably helping to bring in new developers, a lot of the developers who start there appear to not stay in the kernel community.
Overall, the pattern looks reasonably healthy. There are multiple paths for developers looking to join our community, and it is possible for new developers to work almost anywhere in the kernel tree. That would help to explain how the kernel development community continues to grow over time. For now, there doesn't appear to be any reason to believe that we will not continue to crank out kernel releases at a high rate indefinitely.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>