Kernel development
Brief items
Kernel release status
The current development kernel is 3.17-rc6, released on September 21. Linus said: "It's been quiet - enough so that coupled with my upcoming travel, this might just be the last -rc, and final 3.17 might be next weekend."
Stable updates: no stable updates have been released in the last week.
Quotes of the week
Kernel development news
Non-blocking buffered file read operations
It is natural to think of buffered I/O on Unix-like systems as being asynchronous. In Linux, the page cache exists to separate process-level I/O requests from the physical requests sent to the storage media. But, in truth, some operations are synchronous; in particular, a read operation cannot complete until the data is actually read into the page cache. So a call to read() on a normal file descriptor can always block; most of the time this blocking causes no difficulties, but it can be problematic for programs that need to always be responsive. Now, a partial solution is in the works for this kind of code, but it comes at the cost of adding as many as four new system calls.The problem of blocking buffered reads is not new, of course, so applications have worked around it in a number of ways. One common approach is to create a set of threads dedicated to performing buffered I/O. Those threads can block while other threads in the program continue to do other work. This solution works and can be efficient, but it inevitably adds a certain amount of inter-thread communication overhead, especially in cases where the desired data is already in the page cache and a read() call could have completed immediately.
A recent patch set from Milosz Tanski attempts to solve the problem a different way. Milosz's approach is to allow a program to request non-blocking behavior at the level of a single read() call. Unfortunately, the current read() and variants do not have a "flags" argument, so there is no way to express that request using them. So Milosz adds two new versions of each of read() and write():
int readv2(unsigned long fd, struct iovec *vec, unsigned long vlen, int flags); int writev2(unsigned long fd, struct iovec *vec, unsigned long vlen, int flags); int preadv2(unsigned long fd, struct iovec *vec, unsigned long vlen, unsigned long pos_l, unsigned long pos_h, int flags); int pwritev2(unsigned long fd, struct iovec *vec, unsigned long vlen, unsigned long pos_l, unsigned long pos_h, int flags);
In each case, the system call is just like its predecessor with the exception of the addition of the flags argument. Note that the two offset parameters (pos_l and pos_h) to preadv2() and pwritev2() will be combined into a single off_t parameter at the C library level.
In Milosz's patch set, the only supported flag is RWF_NONBLOCK, which requests non-blocking operation. If a read request is accompanied by this flag, it will only complete if (at least some of) the requested data is already in the page cache; otherwise it returns EAGAIN. The current patch does not start any sort of readahead operation if it is unable to satisfy a non-blocking read request. The new write operations do not support non-blocking operation; the flags argument must be zero when calling them. Adding non-blocking behavior to write() is possible; such a write would only complete if memory were immediately available for a copy of the data in the page cache. But that implementation has been left as a future exercise.
Considering the alternatives
The patch is relatively simple and straightforward, but one might well wonder: why is it necessary to add a new set of system calls for non-blocking operation when the kernel has long supported this mode via either the O_NONBLOCK flag to open() or fcntl()? There are, it seems, a couple of reasons for not wanting to implement ordinary non-blocking I/O behavior for regular files, the first of which being that it will break applications.
Given that non-blocking I/O is an optional behavior that must be explicitly requested, it is not obvious that supporting it for regular files would create trouble. It comes down to the fact that passing O_NONBLOCK to an open() call actually requests two different things: (1) that the open() call, itself, not block, and (2) that subsequent I/O be non-blocking. There are applications that use O_NONBLOCK for the first reason; Samba uses it, for example, to keep an open() call from blocking in the presence of locks on the file. Since buffered reads have always blocked regardless of O_NONBLOCK, applications do not concern themselves with calling fcntl() to reset the flag before calling read(). If those read() calls start returning EAGAIN, the application, which is not expecting that behavior, will fail.
One could argue that this behavior is incorrect, but it has worked for decades; breaking these applications with a kernel change is not acceptable. Samba is not the only application to run into trouble here; evidently squid and GQview fail as well. So the problem is clearly real.
Beyond that, as Volker Lendecke explained, full non-blocking behavior would not play well with how applications like Samba want to use this feature. The wish is to attempt to read the needed data in the non-blocking mode; should the data not be available, the request will be turned over to the thread pool for synchronous execution. If the thread pool is using the same file descriptor, its attempts to perform blocking reads will fail. If it uses a different file descriptor, it can run into weird problems relating to the surprising semantics of POSIX file locks (see this article for more information). So the ability to request non-blocking behavior on a per-read basis is needed.
Another possibility would be to add a version of the fincore() system call, which allows a process to ask the kernel whether a specific range of file data is present in the page cache. Patches adding fincore() have been around since at least 2010. But fincore() is seen as a bit of an indirect route toward the real goal, and there is always the possibility that the situation might change between a call to fincore() and the application's decision to do something based on the result. Requesting non-blocking behavior with the read() avoids that particular race condition.
Finally, one could also consider the kernel's asynchronous I/O subsystem, which allows an application to obtain non-blocking behavior on a per-request basis. But asynchronous I/O has never been supported for buffered I/O, and attempts to add that functionality have bogged down in the sheer complexity of the problem. Adding non-blocking behavior to read() — where, unlike with asynchronous I/O, a request can simply fail if it cannot be satisfied immediately — is far simpler.
So the end result would appear to be that we will get a new set of Linux-specific system calls allowing applications to request non-blocking read() behavior on regular files. The rate of change on this patch set is slowing — though it is worth noting that readv2() and writev2() have been removed from the latest version (as of this writing) of the patch set. It is getting late to have this code ready for the 3.18 development cycle, but it should be more than ready for 3.19.
The BPF system call API, version 14
Things happen quickly in the Berkeley Packet Filter (BPF) world. LWN last looked at this work in July, when version 2 of the patch set adding the bpf() system call had been posted. Two months later, this work is up to version 14; quite a bit has been changed and some functionality has been removed in an attempt to make the patches small enough for reviewers to cope with. At this point, though, the core system call may be reaching a point where it is getting close to ready for entry into the mainline. It seems like a good time for another look at this significant addition to the kernel's functionality, with fervent hopes that it doesn't change yet again.BPF developer Alexei Starovoitov has certainly been energetic in his efforts to get this work in condition for merging — the posting of twelve versions in two months, many with significant changes, testifies to that. He has been responsive to requests for changes, but, as this complaint suggests, some developers have found him to be a little too pushy. That has not stopped some of his work from getting into the mainline, though, and, in the end, should not be a real impediment to the eventual merging of the rest.
As with previous versions, the BPF functionality is accessed by way of a single multiplexor system call, but that call has changed significantly:
#include <linux/bpf.h> int bpf(int cmd, union bpf_attr *attr, unsigned int size);
The key change, made following a suggestion from Ingo Molnar, is to create a single large union type holding all of the possible parameter types for the various operations supported by bpf(). How that union is used depends on the specific command given to the system call.
Most of the operations in the current patch set are concerned with the management of maps — arrays of data that can be shared between a BPF program and user space. The process starts with the creation of a map, done with the BPF_MAP_CREATE command. With this command, the system call expects the relevant information to be in this member of the bpf_attr union:
struct { /* anonymous struct used by BPF_MAP_CREATE command */ __u32 map_type; __u32 key_size; /* size of key in bytes */ __u32 value_size; /* size of value in bytes */ __u32 max_entries; /* max number of entries in a map */ };
The map_type field describes the type of the map. The plan is to have a wide range of types, including hashed arrays, ordinary arrays, bloom filters, and radix trees. The current implementation claims to only support the hash type, but even that implementation is missing from the actual submission. The key_size and value_size parameters tell the code how large the keys and associated values will be, while max_entries puts an upper bound on the number of items that can be stored in a map.
When a call is made to bpf() to create a map, everything in the bpf_attr union beyond the above structure must be set to zero, and size should be the size of the union as a whole. These rules, which apply to all bpf() operations, are enforced in the code; the purpose is to allow the addition of more information to this union to support future enhancements to BPF functionality. If new fields are added, newer applications can provide the needed information. Older applications, instead, will have to pass zeroes in those fields, so the right thing will happen.
Upon successful creation of a map, the return value from bpf() will be an open file descriptor which can be used to refer to that map.
There is a set of commands to operate on individual entries in a map; they all use this structure within the bpf_attr union:
struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ __u32 map_fd; __aligned_u64 key; union { __aligned_u64 value; __aligned_u64 next_key; }; };
For all operations, map_fd is the file descriptor referring to the map to be used, and key is a pointer to the key of interest. To store an item in the map, the BPF_MAP_UPDATE_ELEM command should be used; in this case, value should be a pointer to the data to be stored. To look up an item, use BPF_MAP_LOOKUP_ELEM; if the item is present in the map, its value will be stored in the location pointed to by value. Items can be deleted with BPF_MAP_DELETE_ELEM.
Iterating through a map is done with BPF_MAP_GET_NEXT_KEY; it will return the next key following the provided key. The meaning of "next" is dependent on the type of the map. Should the given key not be found in the map, next_key will be set to the first key in the map, so a typical iteration is likely to be started by calling BPF_MAP_GET_NEXT_KEY with a nonsense key.
Note that there is no command to delete a map. Instead, the program that created the map need only close the associated file descriptor; when all descriptors are closed and no loaded BPF programs reference the map, it will be deleted.
Loading a BPF program into the kernel is accomplished with the BPF_PROG_LOAD command. The relevant structure in this case is:
struct { /* anonymous struct used by BPF_PROG_LOAD command */ __u32 prog_type; __u32 insn_cnt; __aligned_u64 insns; /* 'const struct bpf_insn *' */ __aligned_u64 license; /* 'const char *' */ __u32 log_level; /* verbosity level of eBPF verifier */ __u32 log_size; /* size of user buffer */ __aligned_u64 log_buf; /* user supplied 'char *' buffer */ };
Here, prog_type describes the context in which a program is expected to be used; it controls which data and helper functions will be available to the program when it runs. BPF_PROG_TYPE_SOCKET is used for programs that will be attached to sockets, while BPF_PROG_TYPE_TRACING is for tracing filters. The size of the program (in instructions) is provided in insn_cnt, while insns points to the program itself. The license field points to a description of the license for the program; it may be used in the future to restrict some functionality to GPL-compatible programs.
All programs must pass the BPF verifier as part of the loading process. This verifier is meant to ensure that the program cannot do harm to the system as a whole. It will prevent accesses to arbitrary data, disallow programs that have loops, and more. Should a developer want to know why the verifier is rejecting a program, they can set up a logging buffer of length log_size, pointed to by log_buf. Actually turning on logging is done by setting log_level to a non-zero value.
Note that the "fixup" array found in early versions of the patch set is no longer present. That array indicated the instructions referring to BPF map file descriptors; said instructions were fixed to use internal pointers by the verifier. Current versions of the patch set, instead, define new BPF instructions for map access. The verifier can recognize those instructions directly, so user space is no longer required to point them out.
In the v14 patch set, there is no way to actually attach BPF programs to interesting events once they are loaded. Such features are meant to be added once the basic BPF functionality has gotten through review and found its way into the mainline. That point seems to be getting closer; the developers who have taken an interest in the API seem to be increasingly happy with what they have. A 3.18 merge seems ambitious at this point, but 3.19 might be a real possibility.
Update: this series was accepted into the net-next tree on September 26, so it almost certainly will show up in 3.18.
Who wrote 3.15 through 3.17
When writing up the 3.14 development statistics, your editor publicly wondered if compiling those reports for every development cycle made sense. That question was followed by a bit of a break; there were no "who wrote..." articles for the 3.15 or 3.16 development cycles. Now that just over six months have passed and the 3.17 kernel is nearing release, it seems like it may be time to take another look at how the kernel development process is working.Since 3.14, kernel release activity has looked like this:
Version Date CSets Devs Lines (thousands) Added Removed Delta 3.15 Jun 8 13,722 1492 1066 707 360 3.16 Aug 3 12,804 1478 578 329 249 3.17 Sep 28* 12,153 1408 692 708 -16 3.15–17 38,679 2546 2336 1744 593
A few interesting things jump out of these numbers. The 3.12 cycle had contributions from 1257 developers. By 3.13, that had increased to 1339, and 3.14 had patches from exactly 1400 developers. So the count of developers contributing to each kernel release, which had hovered in the 1200's for some time, has shown a significant increase. The active kernel development community continues to grow.
The kernel itself also continues to grow, but 3.17 looks like a rare exception. Thanks to the removal of a bunch of unloved code from the staging tree, 3.17 is actually smaller than its predecessor. That has only happened one other time in the history of the Linux kernel; 2.6.36 was smaller than 2.6.35 thanks to the removal of a pile of defconfig files. The overall trend remains unchanged, though; the kernel grew by almost 600,000 lines in the last three releases.
As of 3.17-rc6, Linus was thinking that he would be able to do the 3.17 final release on September 28. Should that schedule hold, the 3.17 kernel will have been produced in a mere 56 days — as was 3.16. Your editor has remarked on the trend of the shortening kernel release cycle for a while. Here is what that trend looks like now (again, assuming the 3.17 release is not delayed):
So the kernel development cycle, it seems, continues to get shorter. How much longer that trend can continue is unclear, though; there must be a minimum period required to get a high-quality release together. One other potentially interesting point: it should be remembered that the final stabilization of the 3.15 release overlapped with the 3.16 merge window. That probably had little to do with why the 3.15 cycle took longer than many others; it was the result of some difficult-to-find last-minute bugs. But one could argue that the 3.16 development cycle should really be counted as being one week longer than the release dates would indicate.
Contributors
As can be seen from the table above, 38,679 non-merge changesets were pulled into the mainline repository for the 3.15–3.17 development cycles. Of the 2546 developers who contributed changes during this time, the most active were:
Most active developers, 3.15–3.17
By changesets Hartley Sweeten 919 2.4% Jes Sorensen 767 2.0% Malcolm Priestley 544 1.4% Fabian Frederick 382 1.0% Navin Patidar 378 1.0% Laurent Pinchart 330 0.9% Sachin Kamat 327 0.8% Russell King 316 0.8% Axel Lin 301 0.8% Johan Hedberg 300 0.8% Geert Uytterhoeven 296 0.8% Daniel Vetter 278 0.7% Takashi Iwai 275 0.7% Jingoo Han 265 0.7% Thomas Gleixner 260 0.7% Alexander Shiyan 240 0.6% Ville Syrjälä 235 0.6% Joe Perches 233 0.6% Tejun Heo 231 0.6% Lars-Peter Clausen 226 0.6%
By changed lines Tomi Valkeinen 318894 10.9% Kristina Martšenko 165102 5.6% Larry Finger 164869 5.6% Andrzej Pietrasiewicz 108036 3.7% Mauro Carvalho Chehab 71253 2.4% Greg Kroah-Hartman 68260 2.3% Dave Chinner 48267 1.6% Devin Heitmueller 46125 1.6% Malcolm Priestley 35231 1.2% Jes Sorensen 29412 1.0% Navin Patidar 28871 1.0% Hans Verkuil 27813 0.9% Ben Skeggs 26293 0.9% Mark Hounschell 24285 0.8% Ken Cox 23213 0.8% Hartley Sweeten 21246 0.7% Jason Cooper 20344 0.7% Linus Walleij 19898 0.7% Jake Edge 18218 0.6% Maxime Ripard 14669 0.5%
As is usually the case, Hartley Sweeten contributed more changesets than any other developer; all of those were against the COMEDI drivers in the staging tree. All told, nearly 6,000 patches have been applied against just that subsystem since its entry into staging. Jes Sorensen's work was nearly all against the rtl8723au driver, while Malcolm Priestly worked on the vt6656 driver; both of those drivers are also in the staging tree. Fabian Frederick contributed cleanups throughout the kernel tree, while Navin Patidar focused on the rtl8188eu driver which, unsurprisingly, is also in the staging tree.
In the "lines changed" column, Tomi Valkeinen reached the top with extensive work on the ARM OMAP architecture code and related device tree files. Kristina Martšenko removed 14 drivers from the staging tree, making her the developer who removed the most code during this time. Larry Finger continues his work to rationalize the Realtek wireless drivers in the staging tree, Andrzej Pietrasiewicz did a lot of work in the USB gadget driver, and Video4Linux subsystem maintainer Mauro Carvalho Chehab did extensive work throughout that tree.
The 3.15–3.17 development cycles saw contributions from at least 312 employers, the most active of whom were:
Most active employers, 3.15–3.17
By changesets (None) 4492 11.6% Intel 4088 10.6% Red Hat 3577 9.2% (Unknown) 3409 8.8% Linaro 1702 4.4% Samsung 1646 4.3% SUSE 1243 3.2% IBM 1050 2.7% (Consultant) 1016 2.6% Texas Instruments 942 2.4% Vision Engraving Systems 919 2.4% 763 2.0% Renesas Electronics 753 1.9% Free Electrons 753 1.9% Freescale 620 1.6% C-DAC 400 1.0% Oracle 390 1.0% Imagination Technologies 361 0.9% NVidia 355 0.9% FOSS Outreach Program for Women 336 0.9%
By lines changed (None) 408176 13.9% Texas Instruments 357058 12.2% (Unknown) 338760 11.6% Red Hat 259264 8.8% Samsung 249613 8.5% Intel 180869 6.2% Linaro 93125 3.2% Linux Foundation 68988 2.4% SUSE 52213 1.8% (Consultant) 45952 1.6% IBM 44809 1.5% Free Electrons 42917 1.5% Cisco 33254 1.1% Freescale 32636 1.1% C-DAC 30405 1.0% Renesas Electronics 29973 1.0% 29957 1.0% Realtek 27888 1.0% NVidia 27232 0.9% COMPRO Intelligent Solutions 24722 0.8%
As usual, this picture has remained relatively stable from one release to the next. Mildly notable is the increase in contributions from developers working on their own time, though it would be hard to say that the long-term trend toward decreasing volunteer contributions has ended at this point.
Reviews and conclusion
Finally, it can be interesting to look at who is attaching Reviewed-by tags to patches. That tag is meant both as an indicator that the patch has been reviewed and a means for crediting developers who perform those reviews. The developers with the most Reviewed-by tags during this period were:
Developers with the most Reviewed-by tags Ian Abbott 766 11.0% Josh Triplett 207 3.0% Tomasz Figa 155 2.2% Christoph Hellwig 142 2.0% Ville Syrjälä 132 1.9% Chris Wilson 123 1.8% Johannes Berg 122 1.8% Jesse Barnes 103 1.5% Guenter Roeck 98 1.4% Pieter-Paul Giesberts 92 1.3% David Herrmann 87 1.3% Dave Chinner 86 1.2% Hartley Sweeten 86 1.2% Imre Deak 84 1.2% Alex Elder 84 1.2% Rodrigo Vivi 80 1.2% Alex Deucher 74 1.1% Damien Lespiau 73 1.1% Daniel (Deognyoun) Kim 71 1.0% Franky (Zhenhui) Lin 66 1.0%
Ian Abbott, it seems, has reviewed 766 patches in the 182 days covered by these three development cycles — just over four patches every day, with no breaks for weekends or holidays. It turns out that almost all of those patches were the COMEDI changes submitted by Hartley Sweeten. Josh Triplett, instead, reviewed a wide range of changes from many developers; most of those were changes related to or involving read-copy-update. Tomasz Figa concerns himself with ARM-related changes, Christoph Hellwig is a longstanding reviewer of storage- and filesystem-related patches, and reviews changes to DRM (graphics) drivers.
What is not reflected here, of course, is the vast amount of patch review work that never results in a Reviewed-by tag. In fact, your editor would assert that this mechanism is not working as intended at this point. It is failing to document the bulk of the review work that is being done and serves mostly to highlight which developers make the effort to offer an explicit Reviewed-by tag.
To summarize: what has changed in the six months since LWN last published a set of development statistics? The answer is "not much." The kernel development process continues to roll along, producing releases in a fairly predictable schedule. The pace continues to increase, the community continues to grow, and the development cycle continues to shorten. These are all trends that we have seen for a while, so, to a great extent, it all looks like business as usual.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>