|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.17-rc6, released on September 21. Linus said: "It's been quiet - enough so that coupled with my upcoming travel, this might just be the last -rc, and final 3.17 might be next weekend."

Stable updates: no stable updates have been released in the last week.

Comments (none posted)

Quotes of the week

The problem is that the current code enqueues the same structure onto up to four different lists, and we don't have a quantum computer, so head.next can't point to four different places.
Paul McKenney

I seldom use printk these days. It's far too limited in its uses. For one, most things worth debugging happen thousands of times a second, and printk will just slow things down to a crawl if it is used.
Steven Rostedt

Comments (3 posted)

Kernel development news

Non-blocking buffered file read operations

By Jonathan Corbet
September 23, 2014
It is natural to think of buffered I/O on Unix-like systems as being asynchronous. In Linux, the page cache exists to separate process-level I/O requests from the physical requests sent to the storage media. But, in truth, some operations are synchronous; in particular, a read operation cannot complete until the data is actually read into the page cache. So a call to read() on a normal file descriptor can always block; most of the time this blocking causes no difficulties, but it can be problematic for programs that need to always be responsive. Now, a partial solution is in the works for this kind of code, but it comes at the cost of adding as many as four new system calls.

The problem of blocking buffered reads is not new, of course, so applications have worked around it in a number of ways. One common approach is to create a set of threads dedicated to performing buffered I/O. Those threads can block while other threads in the program continue to do other work. This solution works and can be efficient, but it inevitably adds a certain amount of inter-thread communication overhead, especially in cases where the desired data is already in the page cache and a read() call could have completed immediately.

A recent patch set from Milosz Tanski attempts to solve the problem a different way. Milosz's approach is to allow a program to request non-blocking behavior at the level of a single read() call. Unfortunately, the current read() and variants do not have a "flags" argument, so there is no way to express that request using them. So Milosz adds two new versions of each of read() and write():

    int readv2(unsigned long fd, struct iovec *vec, unsigned long vlen, int flags);
    int writev2(unsigned long fd, struct iovec *vec, unsigned long vlen, int flags);
    int preadv2(unsigned long fd, struct iovec *vec, unsigned long vlen,
		unsigned long pos_l, unsigned long pos_h, int flags);
    int pwritev2(unsigned long fd, struct iovec *vec, unsigned long vlen,
		 unsigned long pos_l, unsigned long pos_h, int flags);

In each case, the system call is just like its predecessor with the exception of the addition of the flags argument. Note that the two offset parameters (pos_l and pos_h) to preadv2() and pwritev2() will be combined into a single off_t parameter at the C library level.

In Milosz's patch set, the only supported flag is RWF_NONBLOCK, which requests non-blocking operation. If a read request is accompanied by this flag, it will only complete if (at least some of) the requested data is already in the page cache; otherwise it returns EAGAIN. The current patch does not start any sort of readahead operation if it is unable to satisfy a non-blocking read request. The new write operations do not support non-blocking operation; the flags argument must be zero when calling them. Adding non-blocking behavior to write() is possible; such a write would only complete if memory were immediately available for a copy of the data in the page cache. But that implementation has been left as a future exercise.

Considering the alternatives

The patch is relatively simple and straightforward, but one might well wonder: why is it necessary to add a new set of system calls for non-blocking operation when the kernel has long supported this mode via either the O_NONBLOCK flag to open() or fcntl()? There are, it seems, a couple of reasons for not wanting to implement ordinary non-blocking I/O behavior for regular files, the first of which being that it will break applications.

Given that non-blocking I/O is an optional behavior that must be explicitly requested, it is not obvious that supporting it for regular files would create trouble. It comes down to the fact that passing O_NONBLOCK to an open() call actually requests two different things: (1) that the open() call, itself, not block, and (2) that subsequent I/O be non-blocking. There are applications that use O_NONBLOCK for the first reason; Samba uses it, for example, to keep an open() call from blocking in the presence of locks on the file. Since buffered reads have always blocked regardless of O_NONBLOCK, applications do not concern themselves with calling fcntl() to reset the flag before calling read(). If those read() calls start returning EAGAIN, the application, which is not expecting that behavior, will fail.

One could argue that this behavior is incorrect, but it has worked for decades; breaking these applications with a kernel change is not acceptable. Samba is not the only application to run into trouble here; evidently squid and GQview fail as well. So the problem is clearly real.

Beyond that, as Volker Lendecke explained, full non-blocking behavior would not play well with how applications like Samba want to use this feature. The wish is to attempt to read the needed data in the non-blocking mode; should the data not be available, the request will be turned over to the thread pool for synchronous execution. If the thread pool is using the same file descriptor, its attempts to perform blocking reads will fail. If it uses a different file descriptor, it can run into weird problems relating to the surprising semantics of POSIX file locks (see this article for more information). So the ability to request non-blocking behavior on a per-read basis is needed.

Another possibility would be to add a version of the fincore() system call, which allows a process to ask the kernel whether a specific range of file data is present in the page cache. Patches adding fincore() have been around since at least 2010. But fincore() is seen as a bit of an indirect route toward the real goal, and there is always the possibility that the situation might change between a call to fincore() and the application's decision to do something based on the result. Requesting non-blocking behavior with the read() avoids that particular race condition.

Finally, one could also consider the kernel's asynchronous I/O subsystem, which allows an application to obtain non-blocking behavior on a per-request basis. But asynchronous I/O has never been supported for buffered I/O, and attempts to add that functionality have bogged down in the sheer complexity of the problem. Adding non-blocking behavior to read() — where, unlike with asynchronous I/O, a request can simply fail if it cannot be satisfied immediately — is far simpler.

So the end result would appear to be that we will get a new set of Linux-specific system calls allowing applications to request non-blocking read() behavior on regular files. The rate of change on this patch set is slowing — though it is worth noting that readv2() and writev2() have been removed from the latest version (as of this writing) of the patch set. It is getting late to have this code ready for the 3.18 development cycle, but it should be more than ready for 3.19.

Comments (21 posted)

The BPF system call API, version 14

By Jonathan Corbet
September 24, 2014
Things happen quickly in the Berkeley Packet Filter (BPF) world. LWN last looked at this work in July, when version 2 of the patch set adding the bpf() system call had been posted. Two months later, this work is up to version 14; quite a bit has been changed and some functionality has been removed in an attempt to make the patches small enough for reviewers to cope with. At this point, though, the core system call may be reaching a point where it is getting close to ready for entry into the mainline. It seems like a good time for another look at this significant addition to the kernel's functionality, with fervent hopes that it doesn't change yet again.

BPF developer Alexei Starovoitov has certainly been energetic in his efforts to get this work in condition for merging — the posting of twelve versions in two months, many with significant changes, testifies to that. He has been responsive to requests for changes, but, as this complaint suggests, some developers have found him to be a little too pushy. That has not stopped some of his work from getting into the mainline, though, and, in the end, should not be a real impediment to the eventual merging of the rest.

As with previous versions, the BPF functionality is accessed by way of a single multiplexor system call, but that call has changed significantly:

   #include <linux/bpf.h>

   int bpf(int cmd, union bpf_attr *attr, unsigned int size);

The key change, made following a suggestion from Ingo Molnar, is to create a single large union type holding all of the possible parameter types for the various operations supported by bpf(). How that union is used depends on the specific command given to the system call.

Most of the operations in the current patch set are concerned with the management of maps — arrays of data that can be shared between a BPF program and user space. The process starts with the creation of a map, done with the BPF_MAP_CREATE command. With this command, the system call expects the relevant information to be in this member of the bpf_attr union:

    struct { /* anonymous struct used by BPF_MAP_CREATE command */
	__u32             map_type;
	__u32             key_size;    /* size of key in bytes */
	__u32             value_size;  /* size of value in bytes */
	__u32             max_entries; /* max number of entries in a map */
    };

The map_type field describes the type of the map. The plan is to have a wide range of types, including hashed arrays, ordinary arrays, bloom filters, and radix trees. The current implementation claims to only support the hash type, but even that implementation is missing from the actual submission. The key_size and value_size parameters tell the code how large the keys and associated values will be, while max_entries puts an upper bound on the number of items that can be stored in a map.

When a call is made to bpf() to create a map, everything in the bpf_attr union beyond the above structure must be set to zero, and size should be the size of the union as a whole. These rules, which apply to all bpf() operations, are enforced in the code; the purpose is to allow the addition of more information to this union to support future enhancements to BPF functionality. If new fields are added, newer applications can provide the needed information. Older applications, instead, will have to pass zeroes in those fields, so the right thing will happen.

Upon successful creation of a map, the return value from bpf() will be an open file descriptor which can be used to refer to that map.

There is a set of commands to operate on individual entries in a map; they all use this structure within the bpf_attr union:

    struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
	__u32             map_fd;
	__aligned_u64     key;
	union {
	    __aligned_u64 value;
	    __aligned_u64 next_key;
	};
    };

For all operations, map_fd is the file descriptor referring to the map to be used, and key is a pointer to the key of interest. To store an item in the map, the BPF_MAP_UPDATE_ELEM command should be used; in this case, value should be a pointer to the data to be stored. To look up an item, use BPF_MAP_LOOKUP_ELEM; if the item is present in the map, its value will be stored in the location pointed to by value. Items can be deleted with BPF_MAP_DELETE_ELEM.

Iterating through a map is done with BPF_MAP_GET_NEXT_KEY; it will return the next key following the provided key. The meaning of "next" is dependent on the type of the map. Should the given key not be found in the map, next_key will be set to the first key in the map, so a typical iteration is likely to be started by calling BPF_MAP_GET_NEXT_KEY with a nonsense key.

Note that there is no command to delete a map. Instead, the program that created the map need only close the associated file descriptor; when all descriptors are closed and no loaded BPF programs reference the map, it will be deleted.

Loading a BPF program into the kernel is accomplished with the BPF_PROG_LOAD command. The relevant structure in this case is:

    struct { /* anonymous struct used by BPF_PROG_LOAD command */
	__u32         prog_type;
	__u32         insn_cnt;
	__aligned_u64 insns;     /* 'const struct bpf_insn *' */
	__aligned_u64 license;   /* 'const char *' */
	__u32         log_level; /* verbosity level of eBPF verifier */
	__u32         log_size;  /* size of user buffer */
	__aligned_u64 log_buf;   /* user supplied 'char *' buffer */
    };

Here, prog_type describes the context in which a program is expected to be used; it controls which data and helper functions will be available to the program when it runs. BPF_PROG_TYPE_SOCKET is used for programs that will be attached to sockets, while BPF_PROG_TYPE_TRACING is for tracing filters. The size of the program (in instructions) is provided in insn_cnt, while insns points to the program itself. The license field points to a description of the license for the program; it may be used in the future to restrict some functionality to GPL-compatible programs.

All programs must pass the BPF verifier as part of the loading process. This verifier is meant to ensure that the program cannot do harm to the system as a whole. It will prevent accesses to arbitrary data, disallow programs that have loops, and more. Should a developer want to know why the verifier is rejecting a program, they can set up a logging buffer of length log_size, pointed to by log_buf. Actually turning on logging is done by setting log_level to a non-zero value.

Note that the "fixup" array found in early versions of the patch set is no longer present. That array indicated the instructions referring to BPF map file descriptors; said instructions were fixed to use internal pointers by the verifier. Current versions of the patch set, instead, define new BPF instructions for map access. The verifier can recognize those instructions directly, so user space is no longer required to point them out.

In the v14 patch set, there is no way to actually attach BPF programs to interesting events once they are loaded. Such features are meant to be added once the basic BPF functionality has gotten through review and found its way into the mainline. That point seems to be getting closer; the developers who have taken an interest in the API seem to be increasingly happy with what they have. A 3.18 merge seems ambitious at this point, but 3.19 might be a real possibility.

Update: this series was accepted into the net-next tree on September 26, so it almost certainly will show up in 3.18.

Comments (7 posted)

Who wrote 3.15 through 3.17

By Jonathan Corbet
September 24, 2014
When writing up the 3.14 development statistics, your editor publicly wondered if compiling those reports for every development cycle made sense. That question was followed by a bit of a break; there were no "who wrote..." articles for the 3.15 or 3.16 development cycles. Now that just over six months have passed and the 3.17 kernel is nearing release, it seems like it may be time to take another look at how the kernel development process is working.

Since 3.14, kernel release activity has looked like this:

Version Date CSets Devs Lines (thousands)
Added Removed Delta
3.15 Jun 8 13,722 1492 1066 707 360
3.16 Aug 3 12,804 1478 578 329 249
3.17 Sep 28* 12,153 1408 692 708 -16
3.15–17 38,679 2546 2336 1744 593

A few interesting things jump out of these numbers. The 3.12 cycle had contributions from 1257 developers. By 3.13, that had increased to 1339, and 3.14 had patches from exactly 1400 developers. So the count of developers contributing to each kernel release, which had hovered in the 1200's for some time, has shown a significant increase. The active kernel development community continues to grow.

The kernel itself also continues to grow, but 3.17 looks like a rare exception. Thanks to the removal of a bunch of unloved code from the staging tree, 3.17 is actually smaller than its predecessor. That has only happened one other time in the history of the Linux kernel; 2.6.36 was smaller than 2.6.35 thanks to the removal of a pile of defconfig files. The overall trend remains unchanged, though; the kernel grew by almost 600,000 lines in the last three releases.

As of 3.17-rc6, Linus was thinking that he would be able to do the 3.17 final release on September 28. Should that schedule hold, the 3.17 kernel will have been produced in a mere 56 days — as was 3.16. Your editor has remarked on the trend of the shortening kernel release cycle for a while. Here is what that trend looks like now (again, assuming the 3.17 release is not delayed):

[Development cycle length chart]

So the kernel development cycle, it seems, continues to get shorter. How much longer that trend can continue is unclear, though; there must be a minimum period required to get a high-quality release together. One other potentially interesting point: it should be remembered that the final stabilization of the 3.15 release overlapped with the 3.16 merge window. That probably had little to do with why the 3.15 cycle took longer than many others; it was the result of some difficult-to-find last-minute bugs. But one could argue that the 3.16 development cycle should really be counted as being one week longer than the release dates would indicate.

Contributors

As can be seen from the table above, 38,679 non-merge changesets were pulled into the mainline repository for the 3.15–3.17 development cycles. Of the 2546 developers who contributed changes during this time, the most active were:

Most active developers, 3.15–3.17
By changesets
Hartley Sweeten9192.4%
Jes Sorensen7672.0%
Malcolm Priestley5441.4%
Fabian Frederick3821.0%
Navin Patidar3781.0%
Laurent Pinchart3300.9%
Sachin Kamat3270.8%
Russell King3160.8%
Axel Lin3010.8%
Johan Hedberg3000.8%
Geert Uytterhoeven2960.8%
Daniel Vetter2780.7%
Takashi Iwai2750.7%
Jingoo Han2650.7%
Thomas Gleixner2600.7%
Alexander Shiyan2400.6%
Ville Syrjälä2350.6%
Joe Perches2330.6%
Tejun Heo2310.6%
Lars-Peter Clausen2260.6%
By changed lines
Tomi Valkeinen31889410.9%
Kristina Martšenko1651025.6%
Larry Finger1648695.6%
Andrzej Pietrasiewicz1080363.7%
Mauro Carvalho Chehab712532.4%
Greg Kroah-Hartman682602.3%
Dave Chinner482671.6%
Devin Heitmueller461251.6%
Malcolm Priestley352311.2%
Jes Sorensen294121.0%
Navin Patidar288711.0%
Hans Verkuil278130.9%
Ben Skeggs262930.9%
Mark Hounschell242850.8%
Ken Cox232130.8%
Hartley Sweeten212460.7%
Jason Cooper203440.7%
Linus Walleij198980.7%
Jake Edge182180.6%
Maxime Ripard146690.5%

As is usually the case, Hartley Sweeten contributed more changesets than any other developer; all of those were against the COMEDI drivers in the staging tree. All told, nearly 6,000 patches have been applied against just that subsystem since its entry into staging. Jes Sorensen's work was nearly all against the rtl8723au driver, while Malcolm Priestly worked on the vt6656 driver; both of those drivers are also in the staging tree. Fabian Frederick contributed cleanups throughout the kernel tree, while Navin Patidar focused on the rtl8188eu driver which, unsurprisingly, is also in the staging tree.

In the "lines changed" column, Tomi Valkeinen reached the top with extensive work on the ARM OMAP architecture code and related device tree files. Kristina Martšenko removed 14 drivers from the staging tree, making her the developer who removed the most code during this time. Larry Finger continues his work to rationalize the Realtek wireless drivers in the staging tree, Andrzej Pietrasiewicz did a lot of work in the USB gadget driver, and Video4Linux subsystem maintainer Mauro Carvalho Chehab did extensive work throughout that tree.

The 3.15–3.17 development cycles saw contributions from at least 312 employers, the most active of whom were:

Most active employers, 3.15–3.17
By changesets
(None)449211.6%
Intel408810.6%
Red Hat35779.2%
(Unknown)34098.8%
Linaro17024.4%
Samsung16464.3%
SUSE12433.2%
IBM10502.7%
(Consultant)10162.6%
Texas Instruments9422.4%
Vision Engraving Systems9192.4%
Google7632.0%
Renesas Electronics7531.9%
Free Electrons7531.9%
Freescale6201.6%
C-DAC4001.0%
Oracle3901.0%
Imagination Technologies3610.9%
NVidia3550.9%
FOSS Outreach Program for Women3360.9%
By lines changed
(None)40817613.9%
Texas Instruments35705812.2%
(Unknown)33876011.6%
Red Hat2592648.8%
Samsung2496138.5%
Intel1808696.2%
Linaro931253.2%
Linux Foundation689882.4%
SUSE522131.8%
(Consultant)459521.6%
IBM448091.5%
Free Electrons429171.5%
Cisco332541.1%
Freescale326361.1%
C-DAC304051.0%
Renesas Electronics299731.0%
Google299571.0%
Realtek278881.0%
NVidia272320.9%
COMPRO Intelligent Solutions247220.8%

As usual, this picture has remained relatively stable from one release to the next. Mildly notable is the increase in contributions from developers working on their own time, though it would be hard to say that the long-term trend toward decreasing volunteer contributions has ended at this point.

Reviews and conclusion

Finally, it can be interesting to look at who is attaching Reviewed-by tags to patches. That tag is meant both as an indicator that the patch has been reviewed and a means for crediting developers who perform those reviews. The developers with the most Reviewed-by tags during this period were:

Developers with the most Reviewed-by tags
Ian Abbott76611.0%
Josh Triplett2073.0%
Tomasz Figa1552.2%
Christoph Hellwig1422.0%
Ville Syrjälä1321.9%
Chris Wilson1231.8%
Johannes Berg1221.8%
Jesse Barnes1031.5%
Guenter Roeck981.4%
Pieter-Paul Giesberts921.3%
David Herrmann871.3%
Dave Chinner861.2%
Hartley Sweeten861.2%
Imre Deak841.2%
Alex Elder841.2%
Rodrigo Vivi801.2%
Alex Deucher741.1%
Damien Lespiau731.1%
Daniel (Deognyoun) Kim711.0%
Franky (Zhenhui) Lin661.0%

Ian Abbott, it seems, has reviewed 766 patches in the 182 days covered by these three development cycles — just over four patches every day, with no breaks for weekends or holidays. It turns out that almost all of those patches were the COMEDI changes submitted by Hartley Sweeten. Josh Triplett, instead, reviewed a wide range of changes from many developers; most of those were changes related to or involving read-copy-update. Tomasz Figa concerns himself with ARM-related changes, Christoph Hellwig is a longstanding reviewer of storage- and filesystem-related patches, and reviews changes to DRM (graphics) drivers.

What is not reflected here, of course, is the vast amount of patch review work that never results in a Reviewed-by tag. In fact, your editor would assert that this mechanism is not working as intended at this point. It is failing to document the bulk of the review work that is being done and serves mostly to highlight which developers make the effort to offer an explicit Reviewed-by tag.

To summarize: what has changed in the six months since LWN last published a set of development statistics? The answer is "not much." The kernel development process continues to roll along, producing releases in a fairly predictable schedule. The pace continues to increase, the community continues to grow, and the development cycle continues to shorten. These are all trends that we have seen for a while, so, to a great extent, it all looks like business as usual.

Comments (6 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.17-rc6 ?
Kamal Mostafa Linux 3.13.11.7 ?

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Michael Kerrisk (man-pages) man-pages-3.73 is released ?

Filesystems and block I/O

Memory management

Networking

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds