Brief items
The current development kernel is 3.11-rc6,
released on August 18. Linus said:
"
It's been a fairly quiet week, and the rc's are definitely
shrinking. Which makes me happy." The end of the 3.11 development
cycle is getting closer.
Stable updates: Greg Kroah-Hartman has had a busy week, shipping 3.10.7, 3.4.58, and 3.0.91 on August 14, followed by 3.10.8, 3.4.59, and 3.0.92 on August 20. A few hours later,
he released 3.10.9 and 3.0.93 as single-patch updates to fix a
problem in 3.10.8 and 3.0.92. In an attempt to avoid a repeat of that kind
of problem, he is currently considering some
minor tweaks to the patch selection process for stable updates. In
short, all but the most urgent of patches would have to wait for roughly
one week before being shipped in a stable update.
Other stable updates released this week include 3.6.11.7 (August 19), 3.5.7.19 (August 20), and 3.5.7.20 (August 21).
Comments (none posted)
The program committee for the 2013 Kernel Summit (Edinburgh, October 23-25)
has put out a special call for proposals from hobbyist developers — those
who work on the kernel outside of a paid employment situation.
"
Since most top kernel developers are not hobbyists these days, this
is your opportunity to make up for what we're missing. As we recognize
most hobbyists don't have the resources to attend conferences, we're
offering (as part of the normal kernel summit travel fund processes) travel
reimbursement as part of being selected to attend." The timeline is
tight: proposals should be submitted by August 24.
Full Story (comments: 9)
Linux.com has a
high-level look at control groups (cgroups), focusing on the problems with the current implementation and the plans to fix them going forward. It also looks at what the systemd project is doing to support a single, unified controller hierarchy, rather than the multiple hierarchies that exist today. "
'This is partly because cgroup tends to add complexity and overhead to the existing subsystems and building and bolting something on the side is often the path of the least resistance,' said Tejun Heo, Linux kernel cgroup subsystem maintainer. 'Combined with the fact that cgroup has been exploring new areas without firm established examples to follow, this led to some questionable design choices and relatively high level of inconsistency.'"
Comments (23 posted)
The Software Freedom Conservancy has
announced
that it has helped Samsung to release a version of its exFAT filesystem
implementation under the GPL. This filesystem had previously been
unofficially released after a copy leaked out
of Samsung. "
Conservancy's primary goal, as always, was to assist
and advise toward the best possible resolution to the matter that complied
fully with the GPL. Conservancy is delighted that the correct outcome has
been reached: a legitimate, full release from Samsung of all relevant
source code under the terms of Linux's license, the GPL, version 2."
Comments (20 posted)
Kernel development news
By Jonathan Corbet
August 21, 2013
Back in 2007, the kernel developers
realized that the maintenance of the last-accessed
time for files ("atime") was a significant performance problem. Atime
updates turned every read operation into a write, slowing the I/O subsystem
significantly. The response was to add the "relatime" mount option that
reduced atime updates to the minimum frequency that did not risk breaking
applications. Since then, little thought has gone into the performance
issues associated with file timestamps.
Until now, that is. Unix-like systems actually manage three timestamps
for each file: along with atime, the system maintains the time of the last
modification of the file's contents ("mtime") and the last metadata change
("ctime"). At a first glance, maintaining these times would appear to be
less of a performance problem; updating mtime or ctime requires writing the
file's inode back to disk, but the operation that causes the time to be
updated will be causing a write to happen anyway. So, one would think, any
extra cost would be lost in the noise.
It turns out, though, that there is a situation where that is not the
case — uses where a file is written through a mapping created with
mmap(). Writable memory-mapped files are a bit of a challenge for
the operating system: the application can change any part of the file with
a simple memory reference without notifying the kernel. But the kernel
must learn about the write somehow so that it can eventually push the
modified data back to persistent storage. So, when a file is mapped for
write access and a page is brought into memory, the kernel will mark that
page (in the hardware) as being read-only. An attempt to write that page
will generate a fault, notifying the kernel that the page has been
changed. At that point, the page can be made writable so that further
writes will not generate any more faults; it can stay writable until the
kernel cleans the page by writing it back to disk. Once the page is clean,
it must be marked read-only once again.
The problem, as explained by Dave Chinner,
is this: as soon as the kernel receives the page fault and makes the page
writable, it must update the file's timestamps, and, for some filesystem
types, an associated
revision counter as well. That update is done synchronously in a filesystem
transaction as part of the process of handling the page fault and allowing
write access. So a quick operation to make a page writable turns into a
heavyweight filesystem operation, and it happens every time the application
attempts to write to a clean page. If the application writes large numbers
of pages that have been mapped into memory, the result will be a painful
slowdown. And most of that effort is wasted; the timestamp updates
overwrite each other, so only the last one will persist for any useful
period of time.
As it happens, Andy Lutomirski has an application that is affected badly by
this problem. One of his
previous attempts to address the associated performance problems —
MADV_WILLWRITE — was covered here
recently. Needless to say, he is not a big fan of the current behavior
associated with mtime and ctime updates. He also asserted that the current
behavior violates the Single Unix Specification, which states that those
times must be updated between any write to a page and either the next
msync() call or the writeback of the data in question. The
kernel, he said, does not currently implement the required behavior.
In particular,
he pointed out that the timestamp updates happen after the first
write to a given page. After that first reference, the page is
left writable and the kernel will be unaware of any subsequent modifications until
the page is written back. If the page remains in memory for a long time
(multiple seconds) before being written back — as is often the case — the
timestamp update will incorrectly reflect the time of the first write, not
the last one.
In an attempt to fix both the performance and correctness issues, Andy has
put together a patch set that changes the
way timestamp updates are handled. In the new scheme, timestamps are not
updated when a page is made writable; instead, a new flag
(AS_CMTIME) is set in the associated address_space
structure. So there is no longer a filesystem transaction that must be done when
the page is made writable. At some future time, the kernel will call the
new flush_cmtime() address space operation to tell the filesystem
that an inode's times should be updated; that call will happen in response
to a writeback operation or an msync() call. So, if thousands of
pages are dirtied before writeback happens, the timestamp updates will be
collapsed into a single transaction, speeding things considerably.
Additionally, the timestamp will reflect the time of the last update
instead of the first.
There have been some quibbles with this approach. One concern is that
there are tight requirements around the handling of timestamps and revision
numbers in filesystems that are exported via NFS. NFS clients use those
timestamps to learn when cached copies of file data have gone stale; if the
timestamp updates are deferred, there is a risk that a client could work
with stale data for some period of time. Andy claimed that, with the current scheme, the
timestamp could be wrong for a far longer period, so, he said, his patch
represents
an improvement, even if it's not perfect. David Lang suggested that perfection could be reached by
updating the timestamps in memory on the first fault but not flushing that
change to disk; Andy saw merit in the idea, but has not implemented it thus
far.
As of this writing, the responses to the patch set itself have mostly been
related
to implementation details. Andy will have a number of things to change in
the patch; it also needs filesystem implementations beyond just ext4 and a
test for the xfstests package to show that things work correctly. But the
core idea no longer seems to be controversial. Barring a change of opinion
within the community, faster write fault handling for file-backed pages
should be headed toward a mainline kernel sometime soon.
Comments (23 posted)
By Jonathan Corbet
August 21, 2013
As of this writing, the
3.11-rc6 prepatch
is out and the 3.11 development cycle appears to be slowly drawing toward a
close. That can only mean one thing: it must be about time to look at some
statistics from this cycle and see where the contributions came from. 3.11
looks like a fairly typical 3.x cycle, but, as always, there's a small
surprise or two for those who look.
Developers and companies
Just over 10,700 non-merge changesets have been pulled into the repository
(so far) for 3.11; they added over 775,000 lines of code and removed over
328,000 lines for a net growth of 447,000 lines. So this remains a rather slower cycle than 3.10, which
was well past 13,000 changesets by the -rc6 release. As might be expected,
the number of developers contributing to this release has dropped along
with the changeset count, but this kernel still reflects contributions from
1,239 developers. The most active of those developers were:
| Most active 3.11 developers |
| By changesets |
| H Hartley Sweeten | 333 | 3.1% |
| Sachin Kamat | 302 | 2.8% |
| Alex Deucher | 254 | 2.4% |
| Jingoo Han | 190 | 1.8% |
| Laurent Pinchart | 147 | 1.4% |
| Daniel Vetter | 137 | 1.3% |
| Al Viro | 131 | 1.2% |
| Hans Verkuil | 123 | 1.1% |
| Lee Jones | 112 | 1.0% |
| Xenia Ragiadakou | 100 | 0.9% |
| Wei Yongjun | 99 | 0.9% |
| Jiang Liu | 98 | 0.9% |
| Lars-Peter Clausen | 91 | 0.8% |
| Linus Walleij | 90 | 0.8% |
| Johannes Berg | 86 | 0.8% |
| Tejun Heo | 85 | 0.8% |
| Oleg Nesterov | 71 | 0.7% |
| Fabio Estevam | 70 | 0.7% |
| Tomi Valkeinen | 69 | 0.6% |
| Dan Carpenter | 66 | 0.6% |
|
| By changed lines |
| Peng Tao | 260439 | 26.9% |
| Greg Kroah-Hartman | 91973 | 9.5% |
| Alex Deucher | 55904 | 5.8% |
| Kalle Valo | 22103 | 2.3% |
| Ben Skeggs | 20282 | 2.1% |
| Eli Cohen | 15886 | 1.6% |
| Solomon Peachy | 15510 | 1.6% |
| Aaro Koskinen | 13443 | 1.4% |
| H Hartley Sweeten | 11043 | 1.1% |
| Laurent Pinchart | 8923 | 0.9% |
| Benoit Cousson | 8734 | 0.9% |
| Tomi Valkeinen | 8246 | 0.9% |
| Yuan-Hsin Chen | 8222 | 0.9% |
| Tomasz Figa | 7668 | 0.8% |
| Xenia Ragiadakou | 5136 | 0.5% |
| Johannes Berg | 5029 | 0.5% |
| Maarten Lankhorst | 4924 | 0.5% |
| Marc Zyngier | 4817 | 0.5% |
| Hans Verkuil | 4707 | 0.5% |
| Linus Walleij | 4379 | 0.5% |
|
Someday, somehow, somebody will manage to displace H. Hartley Sweeten from
the top of the by-changesets list, but that was not fated to be in the 3.11
cycle. As always, he is working on cleaning up the Comedi drivers in the
staging tree — a task that has led to the merging of almost 4,000
changesets into the kernel so far. Sachin Kamat contributed a large set
of cleanups throughout the driver tree, Alex Deucher is the primary
developer for the Radeon graphics driver, Jingoo Han, like
Sachin, did a bunch of driver cleanup work, and Laurent Pinchart did a lot of
Video4Linux and ARM architecture work.
On the "lines changed" side, Peng Tao added the Lustre filesystem to the
staging tree, while Greg Kroah-Hartman removed the unloved csr driver
from that tree. Alex's Radeon work has already been mentioned; Kalle Valo
added the ath10k wireless network driver, while Ben Skeggs continued to
improve the Nouveau graphics driver.
Almost exactly 200 employers supported work on the 3.11 kernel; the most
active of those were:
| Most active 3.11 employers |
| By changesets |
| (None) | 976 | 9.1% |
| Intel | 970 | 9.1% |
| Red Hat | 911 | 8.5% |
| Linaro | 890 | 8.3% |
| Samsung | 485 | 4.5% |
| (Unknown) | 483 | 4.5% |
| IBM | 418 | 3.9% |
| Vision Engraving Systems | 333 | 3.1% |
| Texas Instruments | 319 | 3.0% |
| SUSE | 310 | 2.9% |
| AMD | 281 | 2.6% |
| Renesas Electronics | 265 | 2.5% |
| Outreach Program for Women | 230 | 2.1% |
| Google | 224 | 2.1% |
| Freescale | 151 | 1.4% |
| Oracle | 137 | 1.3% |
| ARM | 135 | 1.3% |
| Cisco | 132 | 1.2% |
|
| By lines changed |
| (None) | 307996 | 31.9% |
| Linux Foundation | 93929 | 9.7% |
| AMD | 57745 | 6.0% |
| Red Hat | 52679 | 5.5% |
| Intel | 40868 | 4.2% |
| Texas Instruments | 28819 | 3.0% |
| Qualcomm | 26215 | 2.7% |
| Renesas Electronics | 24084 | 2.5% |
| Samsung | 23413 | 2.4% |
| Linaro | 20649 | 2.1% |
| (Unknown) | 17362 | 1.8% |
| IBM | 17337 | 1.8% |
| AbsoluteValue Systems | 16872 | 1.7% |
Nokia | 16847 | 1.7% |
| Mellanox | 16841 | 1.7% |
| Vision Engraving Systems | 12268 | 1.3% |
| Outreach Program for Women | 11499 | 1.2% |
| SUSE | 10279 | 1.1% |
|
Once again, the percentage of changes coming from volunteers (listed as
"(None)" above) appears to be
slowly falling; it is down from over 11% in 3.10. Red Hat has, for the
second time,
ceded the top non-volunteer position to Intel, but the fact that Linaro is
closing on Red Hat from below is arguably far more interesting. The
numbers also reflect the large set of contributions that came in from
applicants to the Outreach Program for
Women, which has clearly
succeeded in motivating contributions to the kernel.
Signoffs
Occasionally it is interesting to look at the Signed-off-by tags in patches
in the kernel repository. In particular, if one looks at signoffs by
developers other than the author of the patch, one gets a sense for who the
subsystem maintainers responsible for getting patches into the mainline
are. In the 3.11 cycle, the top gatekeepers were:
| Most non-author signoffs in 3.11 |
| By developer |
| Greg Kroah-Hartman | 1212 | 12.3% |
| David S. Miller | 801 | 8.1% |
| Andrew Morton | 611 | 6.2% |
| Mauro Carvalho Chehab | 371 | 3.8% |
| John W. Linville | 285 | 2.9% |
| Mark Brown | 276 | 2.8% |
| Daniel Vetter | 264 | 2.7% |
| Simon Horman | 252 | 2.6% |
| Linus Walleij | 236 | 2.4% |
| Benjamin Herrenschmidt | 172 | 1.7% |
| Kyungmin Park | 157 | 1.6% |
| James Bottomley | 143 | 1.4% |
| Ingo Molnar | 132 | 1.3% |
| Rafael J. Wysocki | 131 | 1.3% |
| Kukjin Kim | 121 | 1.2% |
| Dave Airlie | 121 | 1.2% |
| Shawn Guo | 121 | 1.2% |
| Felipe Balbi | 119 | 1.2% |
| Johannes Berg | 117 | 1.2% |
| Ralf Baechle | 110 | 1.1% |
|
| By employer |
| Red Hat | 2156 | 21.9% |
| Linux Foundation | 1249 | 12.7% |
| Intel | 904 | 9.2% |
| Google | 788 | 8.0% |
| Linaro | 759 | 7.7% |
| Samsung | 429 | 4.4% |
| (None) | 408 | 4.1% |
| IBM | 332 | 3.4% |
| Renesas Electronics | 259 | 2.6% |
| SUSE | 249 | 2.5% |
| Texas Instruments | 237 | 2.4% |
| Parallels | 143 | 1.5% |
| Wind River | 126 | 1.3% |
| (Unknown) | 124 | 1.3% |
| Wolfson Microelectronics | 114 | 1.2% |
| Broadcom | 97 | 1.0% |
| Fusion-IO | 89 | 0.9% |
| OLPC | 87 | 0.9% |
| (Consultant) | 86 | 0.9% |
| Cisco | 80 | 0.8% |
|
We first looked at signoffs for 2.6.22 in
2007. Looking now, there are many of the same names on the list — but also
quite a few changes. As is the case with other aspects of kernel
development, the changes in signoffs reflect the growing importance of the
mobile and embedded sector. The good news, as reflected in these numbers,
is that mobile and embedded developers are finding roles as subsystem
maintainers, giving them a stronger say in the direction of kernel
development going forward.
Persistence of code
Finally, it has been some time since we looked
at persistence of code over time; in particular, we examined how much
code from each development cycle remained in the 2.6.33 kernel. This
information is obtained through the laborious process of running
"git blame" on each file, looking at the commit associated
with each line, and mapping that to the release in which that commit was
merged. Doing the same thing now yields a plot that looks like this:
From this we see that the code added for 3.11 makes up a little over 4% of
the kernel as a whole; as might be expected, the percentage drops as one
looks at older releases. Still, quite a bit of code from the early
2.6.30's remains untouched to this day. Incidentally, about 19% of the
code in the kernel has not been changed since the beginning of the git era;
there are still 545 files that have not
been changed at all since the 2.6.12 development cycle.
Another way to look at things would be to see how many lines from each
cycle were in the kernel in 2.6.33 (the last time this exercise was done)
compared to what's there now. That yields:
Thus, for example, the 2.6.33 kernel had about 400,000 lines from 2.6.26;
of those, about 290,000 remain in 3.11. One other thing that stands out is
that the early 2.6.30 development cycles saw fewer changesets merged into the
mainline than, say, 3.10 did, but they added more code. Much of that code
has since been changed or removed, though. Given that much of that code
went into the staging tree, this result is not entirely surprising; the
whole point of putting code into staging is to set it up for rapid change.
Actually, "rapid change" describes just about all of the data presented
here. The kernel process continues to absorb changes at a surprising and,
seemingly, increasing rate without showing any serious signs of strain.
There is almost certainly a limit to the scalability of the current
process, but we do not appear to have found it yet.
Comments (12 posted)
By Jonathan Corbet
August 20, 2013
Some ideas take longer than others to find their way into the mainline
kernel. The network firewalling mechanism known as "nftables" would
be a case in point. Much of this work was done in 2009; despite showing
a lot of promise at the time, the work languished for years afterward.
But, now, there would appear to be a critical mass of developers working on
nftables, and we may well see it merged in the relatively near future.
A firewall works by testing a packet against a chain of one or more rules.
Any of those rules may decide that the packet is to be accepted or
rejected, or it may defer judgment for subsequent rules. Rules may include
tests that take
forms like "which TCP port is this packet destined for?", "is the source IP
address on a trusted network?", or "is this packet associated with a known,
open connection?", for example. Since the tests applied to packets are
expressed in networking terms (ports, IP addresses, etc.), the code that
implements the firewall subsystem ("netfilter") has traditionally contained
a great deal of protocol awareness. In fact, this awareness is built so
deeply into the code that it has had to be replicated four times — for
IPv4, IPv6, ARP, and Ethernet bridging — because the firewall engines are
too protocol-specific to be used in a generic manner.
That duplication of code is one of a number of shortcomings in netfilter
that have long driven a desire for a replacement. In 2009, it appeared that
such a replacement was in the works when Patrick McHardy announced his nftables project. Nftables replaces the
multiple netfilter implementations with a single packet filtering engine
built on an in-kernel virtual machine, unifying firewalling at the expense
of putting (another) bytecode interpreter into the kernel. At the time,
the reaction to the idea was mostly positive, but work stalled on nftables
just the same. Patrick committed some changes in July 2010; after that, he
made no more commits for more than two years.
Frustrations with the current firewalling code did not just go away,
though. Over time, it also became clear that a general-purpose in-kernel
packet classification engine could find uses beyond firewalls; packet
scheduling is another fairly obvious possibility. So, in October 2012,
current netfilter maintainer Pablo Neira Ayuso announced that he was resurrecting Patrick's
nftables patches with an eye toward relatively quick merging into the
mainline. Since then, development of the code has accelerated, with
nftables discussion now generating much of the traffic on the netfilter
mailing list.
Nftables as it exists today is still built on the core principles designed
by Patrick. It adds a simple virtual machine to the kernel that is able to
execute bytecode to inspect a network packet and make decisions on how that
packet should be handled. The operations implemented
by this machine are intentionally basic: it can get data from the packet
itself, look at the associated metadata (which interface the packet arrived
at, for example), and manage connection tracking data. Arithmetic,
bitwise, and comparison operators can be used to make decisions based on
that data.
The virtual machine is capable of manipulating sets of data (typically IP
addresses), allowing multiple comparison operations to be replaced with a
single set lookup. There is also a "map" type that can be used to store
packet decisions directly under a key of interest — again, usually an IP
address. So, for example, a whitelist map could hold a set of known IP
addresses, associating an "accept" verdict with each.
Replacing the current, well-tuned firewalling code with a dumb virtual
machine may seem like a step backward. As it happens, there are signs that
the virtual machine may be faster than the code it replaces, but there are
a number of other advantages independent of performance. At the top of the
list is removing all of the protocol awareness from the decision engine,
allowing a single implementation to serve everywhere a packet inspection
engine is required. The protocol awareness and associated intelligence
can, instead, be pushed out to user space.
Nftables also offers an improved user-space API that allows the atomic
replacement of one or more rules with a single netlink transaction. That
will speed up firewall changes for sites with large rulesets; it can also
help to avoid race conditions while the rule change is being executed.
The code worked reasonably well in 2009, though there were a lot of loose
ends to tie down. At the top of Pablo's list of needed improvements to
nftables when he picked up the project was a bulletproof compatibility
layer for existing netfilter-based
firewalls. A new rule compiler will take existing firewall rules and
compile them for the nftables virtual machine, allowing current firewall
setups to migrate with no changes needed. This compatibility code should
allow nftables to replace the current netfilter tables relatively quickly.
Even so, chances are that both mechanisms will have to coexist in the
kernel for years. One of the other design goals behind nftables — use of
the existing netfilter hook points, connection-tracking infrastructure, and
more — will make that coexistence relatively easy.
Since the work on nftables restarted, the repository has seen over 70
commits from a half-dozen developers; there has also been a lot of work
going into the user-space nft tool and libnftables
library. The kernel changes have added missing features (the ability to
restore saved counter values, for example), compatibility hooks allowing
existing netfilter extensions
to be used until their nftables replacements
are ready, many improvements to the rule update mechanism, IPv6 NAT
support, packet tracing support, ARP filtering support, and more. The
project appears to have picked up some momentum; it seems unlikely to fall
into another multi-year period without activity before being merged.
As to when that merge will happen...it is still too early to say. The
developers are closing in on their set of desired features, but the code has
not yet been exposed to wide review beyond the netfilter list. All that
can be said with certainty is that it appears to be getting closer and to
have the development resources needed to finish the job.
See the nftables web
page for more information. A terse but
useful HOWTO document has been posted by Eric Leblond; it is probably
required reading for anybody wanting to play with this code, but a quick,
casual
read will also answer a number of questions about what firewalling will look
like in the nftables era.
Comments (29 posted)
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>