LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.11-rc6, released on August 18. Linus said: "It's been a fairly quiet week, and the rc's are definitely shrinking. Which makes me happy." The end of the 3.11 development cycle is getting closer.

Stable updates: Greg Kroah-Hartman has had a busy week, shipping 3.10.7, 3.4.58, and 3.0.91 on August 14, followed by 3.10.8, 3.4.59, and 3.0.92 on August 20. A few hours later, he released 3.10.9 and 3.0.93 as single-patch updates to fix a problem in 3.10.8 and 3.0.92. In an attempt to avoid a repeat of that kind of problem, he is currently considering some minor tweaks to the patch selection process for stable updates. In short, all but the most urgent of patches would have to wait for roughly one week before being shipped in a stable update.

Other stable updates released this week include 3.6.11.7 (August 19), 3.5.7.19 (August 20), and 3.5.7.20 (August 21).

Comments (none posted)

Kernel summit 2013: Call for Hobbyists

The program committee for the 2013 Kernel Summit (Edinburgh, October 23-25) has put out a special call for proposals from hobbyist developers — those who work on the kernel outside of a paid employment situation. "Since most top kernel developers are not hobbyists these days, this is your opportunity to make up for what we're missing. As we recognize most hobbyists don't have the resources to attend conferences, we're offering (as part of the normal kernel summit travel fund processes) travel reimbursement as part of being selected to attend." The timeline is tight: proposals should be submitted by August 24.

Full Story (comments: 9)

All About the Linux Kernel: Cgroup's Redesign (Linux.com)

Linux.com has a high-level look at control groups (cgroups), focusing on the problems with the current implementation and the plans to fix them going forward. It also looks at what the systemd project is doing to support a single, unified controller hierarchy, rather than the multiple hierarchies that exist today. "'This is partly because cgroup tends to add complexity and overhead to the existing subsystems and building and bolting something on the side is often the path of the least resistance,' said Tejun Heo, Linux kernel cgroup subsystem maintainer. 'Combined with the fact that cgroup has been exploring new areas without firm established examples to follow, this led to some questionable design choices and relatively high level of inconsistency.'"

Comments (23 posted)

Samsung releases exFAT filesystem source

The Software Freedom Conservancy has announced that it has helped Samsung to release a version of its exFAT filesystem implementation under the GPL. This filesystem had previously been unofficially released after a copy leaked out of Samsung. "Conservancy's primary goal, as always, was to assist and advise toward the best possible resolution to the matter that complied fully with the GPL. Conservancy is delighted that the correct outcome has been reached: a legitimate, full release from Samsung of all relevant source code under the terms of Linux's license, the GPL, version 2."

Comments (20 posted)

Kernel development news

Deferring mtime and ctime updates

By Jonathan Corbet
August 21, 2013
Back in 2007, the kernel developers realized that the maintenance of the last-accessed time for files ("atime") was a significant performance problem. Atime updates turned every read operation into a write, slowing the I/O subsystem significantly. The response was to add the "relatime" mount option that reduced atime updates to the minimum frequency that did not risk breaking applications. Since then, little thought has gone into the performance issues associated with file timestamps.

Until now, that is. Unix-like systems actually manage three timestamps for each file: along with atime, the system maintains the time of the last modification of the file's contents ("mtime") and the last metadata change ("ctime"). At a first glance, maintaining these times would appear to be less of a performance problem; updating mtime or ctime requires writing the file's inode back to disk, but the operation that causes the time to be updated will be causing a write to happen anyway. So, one would think, any extra cost would be lost in the noise.

It turns out, though, that there is a situation where that is not the case — uses where a file is written through a mapping created with mmap(). Writable memory-mapped files are a bit of a challenge for the operating system: the application can change any part of the file with a simple memory reference without notifying the kernel. But the kernel must learn about the write somehow so that it can eventually push the modified data back to persistent storage. So, when a file is mapped for write access and a page is brought into memory, the kernel will mark that page (in the hardware) as being read-only. An attempt to write that page will generate a fault, notifying the kernel that the page has been changed. At that point, the page can be made writable so that further writes will not generate any more faults; it can stay writable until the kernel cleans the page by writing it back to disk. Once the page is clean, it must be marked read-only once again.

The problem, as explained by Dave Chinner, is this: as soon as the kernel receives the page fault and makes the page writable, it must update the file's timestamps, and, for some filesystem types, an associated revision counter as well. That update is done synchronously in a filesystem transaction as part of the process of handling the page fault and allowing write access. So a quick operation to make a page writable turns into a heavyweight filesystem operation, and it happens every time the application attempts to write to a clean page. If the application writes large numbers of pages that have been mapped into memory, the result will be a painful slowdown. And most of that effort is wasted; the timestamp updates overwrite each other, so only the last one will persist for any useful period of time.

As it happens, Andy Lutomirski has an application that is affected badly by this problem. One of his previous attempts to address the associated performance problems — MADV_WILLWRITE — was covered here recently. Needless to say, he is not a big fan of the current behavior associated with mtime and ctime updates. He also asserted that the current behavior violates the Single Unix Specification, which states that those times must be updated between any write to a page and either the next msync() call or the writeback of the data in question. The kernel, he said, does not currently implement the required behavior.

In particular, he pointed out that the timestamp updates happen after the first write to a given page. After that first reference, the page is left writable and the kernel will be unaware of any subsequent modifications until the page is written back. If the page remains in memory for a long time (multiple seconds) before being written back — as is often the case — the timestamp update will incorrectly reflect the time of the first write, not the last one.

In an attempt to fix both the performance and correctness issues, Andy has put together a patch set that changes the way timestamp updates are handled. In the new scheme, timestamps are not updated when a page is made writable; instead, a new flag (AS_CMTIME) is set in the associated address_space structure. So there is no longer a filesystem transaction that must be done when the page is made writable. At some future time, the kernel will call the new flush_cmtime() address space operation to tell the filesystem that an inode's times should be updated; that call will happen in response to a writeback operation or an msync() call. So, if thousands of pages are dirtied before writeback happens, the timestamp updates will be collapsed into a single transaction, speeding things considerably. Additionally, the timestamp will reflect the time of the last update instead of the first.

There have been some quibbles with this approach. One concern is that there are tight requirements around the handling of timestamps and revision numbers in filesystems that are exported via NFS. NFS clients use those timestamps to learn when cached copies of file data have gone stale; if the timestamp updates are deferred, there is a risk that a client could work with stale data for some period of time. Andy claimed that, with the current scheme, the timestamp could be wrong for a far longer period, so, he said, his patch represents an improvement, even if it's not perfect. David Lang suggested that perfection could be reached by updating the timestamps in memory on the first fault but not flushing that change to disk; Andy saw merit in the idea, but has not implemented it thus far.

As of this writing, the responses to the patch set itself have mostly been related to implementation details. Andy will have a number of things to change in the patch; it also needs filesystem implementations beyond just ext4 and a test for the xfstests package to show that things work correctly. But the core idea no longer seems to be controversial. Barring a change of opinion within the community, faster write fault handling for file-backed pages should be headed toward a mainline kernel sometime soon.

Comments (23 posted)

Some numbers from the 3.11 development cycle

By Jonathan Corbet
August 21, 2013
As of this writing, the 3.11-rc6 prepatch is out and the 3.11 development cycle appears to be slowly drawing toward a close. That can only mean one thing: it must be about time to look at some statistics from this cycle and see where the contributions came from. 3.11 looks like a fairly typical 3.x cycle, but, as always, there's a small surprise or two for those who look.

Developers and companies

Just over 10,700 non-merge changesets have been pulled into the repository (so far) for 3.11; they added over 775,000 lines of code and removed over 328,000 lines for a net growth of 447,000 lines. So this remains a rather slower cycle than 3.10, which was well past 13,000 changesets by the -rc6 release. As might be expected, the number of developers contributing to this release has dropped along with the changeset count, but this kernel still reflects contributions from 1,239 developers. The most active of those developers were:

Most active 3.11 developers
By changesets
H Hartley Sweeten3333.1%
Sachin Kamat3022.8%
Alex Deucher2542.4%
Jingoo Han1901.8%
Laurent Pinchart1471.4%
Daniel Vetter1371.3%
Al Viro1311.2%
Hans Verkuil1231.1%
Lee Jones1121.0%
Xenia Ragiadakou1000.9%
Wei Yongjun990.9%
Jiang Liu980.9%
Lars-Peter Clausen910.8%
Linus Walleij900.8%
Johannes Berg860.8%
Tejun Heo850.8%
Oleg Nesterov710.7%
Fabio Estevam700.7%
Tomi Valkeinen690.6%
Dan Carpenter660.6%
By changed lines
Peng Tao26043926.9%
Greg Kroah-Hartman919739.5%
Alex Deucher559045.8%
Kalle Valo221032.3%
Ben Skeggs202822.1%
Eli Cohen158861.6%
Solomon Peachy155101.6%
Aaro Koskinen134431.4%
H Hartley Sweeten110431.1%
Laurent Pinchart89230.9%
Benoit Cousson87340.9%
Tomi Valkeinen82460.9%
Yuan-Hsin Chen82220.9%
Tomasz Figa76680.8%
Xenia Ragiadakou51360.5%
Johannes Berg50290.5%
Maarten Lankhorst49240.5%
Marc Zyngier48170.5%
Hans Verkuil47070.5%
Linus Walleij43790.5%

Someday, somehow, somebody will manage to displace H. Hartley Sweeten from the top of the by-changesets list, but that was not fated to be in the 3.11 cycle. As always, he is working on cleaning up the Comedi drivers in the staging tree — a task that has led to the merging of almost 4,000 changesets into the kernel so far. Sachin Kamat contributed a large set of cleanups throughout the driver tree, Alex Deucher is the primary developer for the Radeon graphics driver, Jingoo Han, like Sachin, did a bunch of driver cleanup work, and Laurent Pinchart did a lot of Video4Linux and ARM architecture work.

On the "lines changed" side, Peng Tao added the Lustre filesystem to the staging tree, while Greg Kroah-Hartman removed the unloved csr driver from that tree. Alex's Radeon work has already been mentioned; Kalle Valo added the ath10k wireless network driver, while Ben Skeggs continued to improve the Nouveau graphics driver.

Almost exactly 200 employers supported work on the 3.11 kernel; the most active of those were:

Most active 3.11 employers
By changesets
(None)9769.1%
Intel9709.1%
Red Hat9118.5%
Linaro8908.3%
Samsung4854.5%
(Unknown)4834.5%
IBM4183.9%
Vision Engraving Systems3333.1%
Texas Instruments3193.0%
SUSE3102.9%
AMD2812.6%
Renesas Electronics2652.5%
Outreach Program for Women2302.1%
Google2242.1%
Freescale1511.4%
Oracle1371.3%
ARM1351.3%
Cisco1321.2%
By lines changed
(None)30799631.9%
Linux Foundation939299.7%
AMD577456.0%
Red Hat526795.5%
Intel408684.2%
Texas Instruments288193.0%
Qualcomm262152.7%
Renesas Electronics240842.5%
Samsung234132.4%
Linaro206492.1%
(Unknown)173621.8%
IBM173371.8%
AbsoluteValue Systems168721.7%
Nokia168471.7%
Mellanox168411.7%
Vision Engraving Systems122681.3%
Outreach Program for Women114991.2%
SUSE102791.1%

Once again, the percentage of changes coming from volunteers (listed as "(None)" above) appears to be slowly falling; it is down from over 11% in 3.10. Red Hat has, for the second time, ceded the top non-volunteer position to Intel, but the fact that Linaro is closing on Red Hat from below is arguably far more interesting. The numbers also reflect the large set of contributions that came in from applicants to the Outreach Program for Women, which has clearly succeeded in motivating contributions to the kernel.

Signoffs

Occasionally it is interesting to look at the Signed-off-by tags in patches in the kernel repository. In particular, if one looks at signoffs by developers other than the author of the patch, one gets a sense for who the subsystem maintainers responsible for getting patches into the mainline are. In the 3.11 cycle, the top gatekeepers were:

Most non-author signoffs in 3.11
By developer
Greg Kroah-Hartman121212.3%
David S. Miller8018.1%
Andrew Morton6116.2%
Mauro Carvalho Chehab3713.8%
John W. Linville2852.9%
Mark Brown2762.8%
Daniel Vetter2642.7%
Simon Horman2522.6%
Linus Walleij2362.4%
Benjamin Herrenschmidt1721.7%
Kyungmin Park1571.6%
James Bottomley1431.4%
Ingo Molnar1321.3%
Rafael J. Wysocki1311.3%
Kukjin Kim1211.2%
Dave Airlie1211.2%
Shawn Guo1211.2%
Felipe Balbi1191.2%
Johannes Berg1171.2%
Ralf Baechle1101.1%
By employer
Red Hat215621.9%
Linux Foundation124912.7%
Intel9049.2%
Google7888.0%
Linaro7597.7%
Samsung4294.4%
(None)4084.1%
IBM3323.4%
Renesas Electronics2592.6%
SUSE2492.5%
Texas Instruments2372.4%
Parallels1431.5%
Wind River1261.3%
(Unknown)1241.3%
Wolfson Microelectronics1141.2%
Broadcom971.0%
Fusion-IO890.9%
OLPC870.9%
(Consultant)860.9%
Cisco800.8%

We first looked at signoffs for 2.6.22 in 2007. Looking now, there are many of the same names on the list — but also quite a few changes. As is the case with other aspects of kernel development, the changes in signoffs reflect the growing importance of the mobile and embedded sector. The good news, as reflected in these numbers, is that mobile and embedded developers are finding roles as subsystem maintainers, giving them a stronger say in the direction of kernel development going forward.

Persistence of code

Finally, it has been some time since we looked at persistence of code over time; in particular, we examined how much code from each development cycle remained in the 2.6.33 kernel. This information is obtained through the laborious process of running "git blame" on each file, looking at the commit associated with each line, and mapping that to the release in which that commit was merged. Doing the same thing now yields a plot that looks like this:

[bar
chart]

From this we see that the code added for 3.11 makes up a little over 4% of the kernel as a whole; as might be expected, the percentage drops as one looks at older releases. Still, quite a bit of code from the early 2.6.30's remains untouched to this day. Incidentally, about 19% of the code in the kernel has not been changed since the beginning of the git era; there are still 545 files that have not been changed at all since the 2.6.12 development cycle.

Another way to look at things would be to see how many lines from each cycle were in the kernel in 2.6.33 (the last time this exercise was done) compared to what's there now. That yields:

[another bar chart]

Thus, for example, the 2.6.33 kernel had about 400,000 lines from 2.6.26; of those, about 290,000 remain in 3.11. One other thing that stands out is that the early 2.6.30 development cycles saw fewer changesets merged into the mainline than, say, 3.10 did, but they added more code. Much of that code has since been changed or removed, though. Given that much of that code went into the staging tree, this result is not entirely surprising; the whole point of putting code into staging is to set it up for rapid change.

Actually, "rapid change" describes just about all of the data presented here. The kernel process continues to absorb changes at a surprising and, seemingly, increasing rate without showing any serious signs of strain. There is almost certainly a limit to the scalability of the current process, but we do not appear to have found it yet.

Comments (12 posted)

The return of nftables

By Jonathan Corbet
August 20, 2013
Some ideas take longer than others to find their way into the mainline kernel. The network firewalling mechanism known as "nftables" would be a case in point. Much of this work was done in 2009; despite showing a lot of promise at the time, the work languished for years afterward. But, now, there would appear to be a critical mass of developers working on nftables, and we may well see it merged in the relatively near future.

A firewall works by testing a packet against a chain of one or more rules. Any of those rules may decide that the packet is to be accepted or rejected, or it may defer judgment for subsequent rules. Rules may include tests that take forms like "which TCP port is this packet destined for?", "is the source IP address on a trusted network?", or "is this packet associated with a known, open connection?", for example. Since the tests applied to packets are expressed in networking terms (ports, IP addresses, etc.), the code that implements the firewall subsystem ("netfilter") has traditionally contained a great deal of protocol awareness. In fact, this awareness is built so deeply into the code that it has had to be replicated four times — for IPv4, IPv6, ARP, and Ethernet bridging — because the firewall engines are too protocol-specific to be used in a generic manner.

That duplication of code is one of a number of shortcomings in netfilter that have long driven a desire for a replacement. In 2009, it appeared that such a replacement was in the works when Patrick McHardy announced his nftables project. Nftables replaces the multiple netfilter implementations with a single packet filtering engine built on an in-kernel virtual machine, unifying firewalling at the expense of putting (another) bytecode interpreter into the kernel. At the time, the reaction to the idea was mostly positive, but work stalled on nftables just the same. Patrick committed some changes in July 2010; after that, he made no more commits for more than two years.

Frustrations with the current firewalling code did not just go away, though. Over time, it also became clear that a general-purpose in-kernel packet classification engine could find uses beyond firewalls; packet scheduling is another fairly obvious possibility. So, in October 2012, current netfilter maintainer Pablo Neira Ayuso announced that he was resurrecting Patrick's nftables patches with an eye toward relatively quick merging into the mainline. Since then, development of the code has accelerated, with nftables discussion now generating much of the traffic on the netfilter mailing list.

Nftables as it exists today is still built on the core principles designed by Patrick. It adds a simple virtual machine to the kernel that is able to execute bytecode to inspect a network packet and make decisions on how that packet should be handled. The operations implemented by this machine are intentionally basic: it can get data from the packet itself, look at the associated metadata (which interface the packet arrived at, for example), and manage connection tracking data. Arithmetic, bitwise, and comparison operators can be used to make decisions based on that data. The virtual machine is capable of manipulating sets of data (typically IP addresses), allowing multiple comparison operations to be replaced with a single set lookup. There is also a "map" type that can be used to store packet decisions directly under a key of interest — again, usually an IP address. So, for example, a whitelist map could hold a set of known IP addresses, associating an "accept" verdict with each.

Replacing the current, well-tuned firewalling code with a dumb virtual machine may seem like a step backward. As it happens, there are signs that the virtual machine may be faster than the code it replaces, but there are a number of other advantages independent of performance. At the top of the list is removing all of the protocol awareness from the decision engine, allowing a single implementation to serve everywhere a packet inspection engine is required. The protocol awareness and associated intelligence can, instead, be pushed out to user space.

Nftables also offers an improved user-space API that allows the atomic replacement of one or more rules with a single netlink transaction. That will speed up firewall changes for sites with large rulesets; it can also help to avoid race conditions while the rule change is being executed.

The code worked reasonably well in 2009, though there were a lot of loose ends to tie down. At the top of Pablo's list of needed improvements to nftables when he picked up the project was a bulletproof compatibility layer for existing netfilter-based firewalls. A new rule compiler will take existing firewall rules and compile them for the nftables virtual machine, allowing current firewall setups to migrate with no changes needed. This compatibility code should allow nftables to replace the current netfilter tables relatively quickly. Even so, chances are that both mechanisms will have to coexist in the kernel for years. One of the other design goals behind nftables — use of the existing netfilter hook points, connection-tracking infrastructure, and more — will make that coexistence relatively easy.

Since the work on nftables restarted, the repository has seen over 70 commits from a half-dozen developers; there has also been a lot of work going into the user-space nft tool and libnftables library. The kernel changes have added missing features (the ability to restore saved counter values, for example), compatibility hooks allowing existing netfilter extensions to be used until their nftables replacements are ready, many improvements to the rule update mechanism, IPv6 NAT support, packet tracing support, ARP filtering support, and more. The project appears to have picked up some momentum; it seems unlikely to fall into another multi-year period without activity before being merged.

As to when that merge will happen...it is still too early to say. The developers are closing in on their set of desired features, but the code has not yet been exposed to wide review beyond the netfilter list. All that can be said with certainty is that it appears to be getting closer and to have the development resources needed to finish the job.

See the nftables web page for more information. A terse but useful HOWTO document has been posted by Eric Leblond; it is probably required reading for anybody wanting to play with this code, but a quick, casual read will also answer a number of questions about what firewalling will look like in the nftables era.

Comments (29 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds