Brief items
The current development kernel is 3.1-rc6,
released on September 14. Things
continue to move slowly in the absence of kernel.org, so there aren't that
many changes this time around. "
Nothing really stands out. Have at
it, and let us know of any outstanding regressions." The repository
is still hosted at Github, naturally.
Stable updates: no stable updates have been released in the last
week.
Comments (none posted)
In short, spatch files can be used on target directories to
generate patches. spdiff can read a patch file and generate an
spatch file for you. What this means for the backporting world is
if you backport one evolutionary change in the Linux kernel for one
driver you can then backport the same change for *all*
drivers. This is a quantum leap in terms of effort required to
backport.
--
Luis Rodriguez makes backporting easy
I'm not sure derivative works law is quite so clear cut, but then
'provide a clear concise definition of derivative works' appears to
be the legal version of The Goldbach Conjecture.
--
Alan Cox
Comments (none posted)
By Jonathan Corbet
September 13, 2011
The security problems at kernel.org have raised concerns about the kernel
source and other software hosted there. There has been no evidence, so
far, that kernel.org was used to distribute any corrupted software. But
there is another aspect
to this breakin: kernel.org is "down for maintenance" and there is no word
as to when it might come back. As a result, even if no malware was
distributed, the kernel.org crack represents a denial of service attack of
significant proportions.
Linus has released two 3.1-rc versions from a temporary site at Github, but
there's not a lot of work to be found there. Among other
things, the loss of all the repositories hosted on kernel.org means that
there is relatively little for him to pull. Stephen Rothwell, meanwhile,
continues to pull the trees he can reach to create linux-next. He is able
to report integration and build problems, but cannot put the tree where others can reach it.
"Besides, I am having a nice restful time." There have been no
stable tree updates since kernel.org went down.
Alternative trees are beginning to pop up across the net as developers find
other places to host their work for now. If the kernel.org outage
continues for some time, we can expect to see many more of those show up -
though some developers are refusing to set
up alternative repositories.
Most of the substitute trees are described as temporary; it will be
interesting to see how many of them actually move back to kernel.org once
this episode has run its course. Some developers may decide that keeping
their trees elsewhere works better for them.
We may have a distributed source control system, but it has become clear
that the kernel community works with a rather centralized hosting and distribution
infrastructure.
The loss of kernel.org has slowed things enough to make it
clear that the process has a single point of failure built into it.
Whether that is worth fixing is not entirely clear; no code should have
been lost and, if kernel.org were ever to disappear permanently, the
process could be back to full speed on other systems in short order. For
now, though, we're seeing things disrupted in a way few other events have
been able to manage. It's interesting to ponder on what would have
happened had the compromise come out during the merge window.
Comments (8 posted)
Kernel development news
By Jonathan Corbet
September 13, 2011
Almost every service offered by Google is delivered over the Internet, so
it makes sense that the company would have an interest in improving how the
net performs. The networking session at the 2011 Linux Plumbers Conference
featured presentations from three Google developers, each of whom had a
proposal for a significant implementation change. Between the three, it
seems, there is still a lot of room for improvement in how we do
networking.
Proportional rate reduction
The "congestion window" is a TCP sender's idea of how much data it can have
in flight to the other end before it starts to overload a link in the middle.
Dropped packets are often a sign that the congestion window is too large,
so TCP implementations normally reduce the window significantly when loss
happens. Cutting the congestion window will reduce performance, though; if
the packet loss was a one-time event, that slowdown will be entirely
unnecessary. RFC 3517
describes an algorithm for bringing the connection up to speed quickly
after a lost packet, but, Nandita Dukkipati says, we can do better.
According to Nandita, a large portion of the network sessions involving
Google's servers
experience losses at some point; the ones that do can take 7-10 times
longer to complete. RFC 3517 is part of the problem. This algorithm
responds to a packet loss by immediately cutting the congestion window in
half; that means that the sending system must, if the congestion window had
been full at the time of the loss, wait for ACKs for half of the in-transit
packets before transmitting again. That causes the sender to go silent for
an extended period of time. It works well enough in simple cases (a single
packet lost in a long-lasting flow), but it tends to clog up the works when
dealing with short flows or extended packet losses.
Linux does not use strict RFC 3517 now; it uses, instead, an enhancement
called "rate halving." With this algorithm, the congestion window is not
halved immediately. Once the connection goes into loss recovery, each
incoming ACK (which will typically acknowledge the receipt of two packets
at the other end) will cause the congestion window to be reduced by a
single packet. Over the course of one full set of in-flight packets, the
window will be cut in half, but the sending system will continue to
transmit (at a lower rate) while that reduction is happening. The result
is a smoother flow and reduced latency.
But rate halving can be improved upon. The ACKs it depends on are
themselves subject to loss; an extended loss can cause significant
reduction of the congestion window and slow recovery. This algorithm also
does not even begin the process of raising the congestion window back to
the highest workable value until the recovery process is complete. So it
can take quite a while to get back up to full speed.
The proportional rate reduction algorithm takes a different approach. The
first step is to calculate an estimate for the amount of data still in
flight, followed by a calculation of what, according to the congestion
control algorithm in use, the congestion window should now be. If the
amount of data in the pipeline is less than the target congestion window,
the system just goes directly into the TCP slow start algorithm to bring
the congestion window back up. Thus, when the connection experiences a
burst of losses, it will start trying to rebuild the congestion window
right away instead of creeping along with a small window for an extended
period.
If, instead, the amount of data in flight is at least as large as the new
congestion window, an algorithm
similar to rate halving is used. The actual reduction is calculated
relative to the new congestion window, though, rather than being a strict
one-half cut. For both large and small losses, the emphasis on using
estimates of the
amount of in-flight data instead of counting ACKs is said to make recovery
go more smoothly and to avoid needless reductions in the congestion window.
How much more better is it? Nandita said that Google has been running
experiments on some of its systems; the result has been a 3-10% reduction
in average latency. Recovery timeouts have been reduced by 5%. This
code is being deployed more widely on Google's servers; it also has been
accepted for merging during the 3.2 development cycle. More information
can be found in this
draft RFC.
TCP fast open
Opening a TCP connection requires a three-packet handshake: a SYN packet
sent by the client, a SYN-ACK response from the server, and a final ACK
from the client. Until the handshake is complete, the link can carry no
data, so the handshake imposes an unavoidable startup latency on every
connection. But what would happen, asked Yuchung Cheng, if one were to
send data with the handshake packets? For simple transactions - an HTTP
GET request followed by the contents of a web page, for example - sending
the relevant data with the handshake packets would eliminate that latency.
The result of this thought is the "TCP fast open" proposal.
RFC 793 (describing TCP)
does allow data to be passed with the handshake packets, with the proviso
that the data not be passed to applications until the handshake completes.
One can consider fudging that last requirement to speed the process of
transmitting data through a TCP connection, but there are some hazards to
be dealt with. An obvious problem is the amplification of SYN flood
attacks, which are bad enough when they only involve the kernel; if each
received SYN packet were to take up application resources as well, the
denial of service possibilities would be significantly worse.
Yuchung described an approach to fast open which is intended to get
around most of the problems. The first step is the creation of a
per-server secret which is hashed with information from each client to
create a per-client cookie. That cookie is sent to the client as a special
option on an ordinary SYN-ACK packet; the client can keep it and use it for
fast opens in the future. The requirement to get a cookie first is a low
bar for the prevention of SYN flood attacks, but it does make things a
little harder. In addition, the server's secret is changed relatively
often, and,
if the server starts to see too many connections, fast open will simply be
disabled until things calm down.
One remaining problem is that about 5% of the systems on the net will drop
SYN packets containing unknown options or data. There is little to be done
in this situation; TCP fast open simply will not work. The client must
thus remember cases where the fast-open SYN packet did not get through and
just use ordinary opens in the future.
Fast open will not happen by default; applications on both ends of the
connection must specifically request it. On the client side, the
sendto() system call is used to request a fast-open connection;
with the new MSG_FAST_OPEN flag, it functions like the combination
of connect() and sendmsg(). On the server side, a
setsockopt() call with the TCP_FAST_OPEN option will
enable fast opens. Either way, applications need not worry about dealing
with the fast-open cookies and such.
In Google's testing, TCP fast open has been seen to improve page load times
by anything between 4% and 40%. This technique works best in situations
where the round trip time is high, naturally; the bigger the latency, the
more value there is in removing it. A patch implementing this feature will
be submitted for inclusion sometime soon.
Briefly: user-space network queues
While the previous two talks were concerned with improving the efficiency
of data transfer over the net, Willem de Bruijn is concerned with network
processing on the local host. In particular, he is working with high-end
hardware: high-speed links, numerous processors, and, importantly, smart
network adapters that can recognize specific flows and direct packets to
connection-specific queues. By the time the kernel gets around to thinking
about a given packet at all, it will already be sorted into the proper
place, waiting for the application to ask for the data.
Actual processing of the packets will happen in the context of the
receiving process as needed. So it all happens in the right context and on
the right CPU; intermediate processing at the software IRQ level will be
avoided. Willem even described a new interface whereby the application
would receive packets directly from the kernel via a shared memory
segment.
In other words, this talk described a variant of the network channels
concept, where packet processing is pushed as close to the application as
possible. There are numerous details to be dealt with, including the usual
hangups for the channels idea: firewall processing and such. The proposed
use of a file in sysfs to pass packets to user space also seems unlikely to
pass review. But this work may eventually reach a point where it is
generally useful; those who are interested can find the patches on the unetq page.
Comments (11 posted)
By Jonathan Corbet
September 14, 2011
As Linaro's CTO, David Rusling spends a lot of time observing the
interactions between the ARM architecture and the mainline kernel
development community. In his Linux Plumbers Conference 2011 keynote,
David made the point that ARM's diversity is behind many of the problems
that have made themselves felt in recent years. Much is being done to
align the ARM community with how the kernel works, but the kernel, too, is
going to have to change if it will successfully address the challenges
posed by increasingly diverse hardware.
David started with a brief note to the effect that he dislikes the
"embedded" term. If a system is connected to the Internet, he said, it is
no longer embedded. Now that everything is so connected, it is time to
stop using that term, and time to stop having separate conferences for
embedded developers. It's all just Linux now.
ARM brings diversity
ARM is a relative newcomer to the industry, having been born in 1990 as
part of a joint venture between Acorn, VLSI, and Apple. The innovative
aspect to ARM was its licensing model; rather than being a processor
produced by a single manufacturer, ARM is a processor design that is
licensed to many manufacturers. The overall architecture for systems built
around
ARM is not constrained by that license, so each vendor creates its own
platform to meet its particular needs. The result has been a lot of
creativity and variety in the hardware marketplace, and a great deal of
commercial success. David estimated that each attendee in the room was
carrying about ten ARM processors; they show up in phones (several of them,
not just "the" processor), in disk controllers, in network interfaces,
etc.
Since each vendor can create a new platform (or more than one), there is no
single view of what makes an ARM processor. Developers working with ARM
usually work with a single vendor's platform and tend not to look beyond
that platform. They are also working under incredibly tight deadlines;
four months from product conception to availability on the shelves is not
uncommon. There is a lot of naivety about open source software, its
processes, and the licensing. In this setting, David said, fragmentation
was inevitable. Linaro has been formed in response in an attempt to help
the ARM community work better with the kernel development community; its
prime mission is to bring about some consolidation in the ARM code base.
Beyond that, he said, Linaro seeks to promote collaboration; without
that, the community will be able to achieve very little. Companies working
in the ARM space recognize the need to collaborate, but they are sometimes
less clear on just which problems they should be trying to solve.
Once upon a time, Microsoft was the dominant empire and Linux was the
upstart rebel child. Needless to say, Linux has been successful in many
areas; it is now settling, he said, into a comfortable middle age. But this
has all happened in the context of the PC architecture, which is not
particularly diverse, so Linux, too, is not hugely diverse. It's also
worth noting that, in this environment, hardware does not ship until
Windows runs on it; making Linux work is often something that comes
afterward.
The mobile world is different;
Android, he said, has become the de facto standard mobile Linux
distribution. It has become known for its "fork, rebase, repeat"
development cycle. Android runs on systems with highly integrated graphics
and media processors, and it is developed with an obsession about battery
lifetime. In this world, things have turned around: now the hardware will
not ship until Linux runs on it. Given the time pressures involved, it is
no wonder, he said, that forking happens.
In the near future we are going to see the arrival of ARM-based server
systems; that is going to stir things up again. They will be very
different from existing servers - and from each other; the diversity of the
ARM world will be seen again. There will be a significant long-term impact
on the kernel as a result. For example, scheduling will have to become
much more aware of power management and thermal management issues. Low
power use will always be a concern, even in the server environment.
Problems to solve
Making all of this work is going to require greater collaboration between
the ARM and kernel communities. ARM developers are developing the habits
needed to work with upstream; the situation is much better than it was a
few years ago. But we are going to need a lot more kernel developers with
an ARM background, and they are going to have to get together and talk to
each other more often. Some of that is beginning to happen; Linaro is
trying to help with this process.
A big problem to deal with, he said, was boot architecture: what happens on
the system before the kernel runs. Regardless of architecture, the boot
systems are all broken and all secret; developers hate them. In the end we
have to communicate system information to the kernel; now we are using
features like ACPI or techniques like flattened device trees. We are
seeing new standards (like UEFI) emerging, but, he asked, are we
influencing those standards enough?
Taking things further: will there be a single ARM platform such that one
kernel can run on any system? The answer was "maybe," but, if so, it is
going to take some time. We're currently in a world where we have many
such platforms - OMAP, iMX, etc. - and pulling them together will be hard.
We need to teach ARM developers that not all code they develop belongs in
their platform tree - or in arch/arm at all. The process of
looking for patterns and turning them into generic code must continue. The
ARM community is working toward the goal of creating a generic kernel;
there are lots of interesting challenges to face, but other architectures
have faced them before.
One step in the right direction is the recent creation of the arm-soc tree,
managed by Arnd Bergmann. The goal of this tree is to support Russell King
(the top-level ARM maintainer) and the platform maintainers and to increase
the efficiency of the whole process. The arm-soc tree has become the path
for much of the ARM consolidation work to get into the mainline kernel.
Returning briefly to power management, David noted that ARM-based systems
usually have no fans. The kernel needs a better thermal management
framework to keep the whole thing from melting. And that framework will
have to reach throughout the kernel; the scheduler may, for example, need
to move processes away from an overheating core to allow it to cool down.
Everywhere we look, he said, we need better instrumentation so we have a
better idea of what is happening with the hardware.
More efficient buffer management is a high priority for ARM devices;
copying data uses power and generates heat, so copying needs to be avoided
whenever possible. But existing kernel mechanisms are not always a good
match to the ARM world, where one can encounter a plethora of memory
management units, weakly-ordered memory, and more. There are a lot of
solutions in the works, including CMA, a reworked DMA mapping framework, and more, but
they are not all yet upstream.
In summary, we have some problems to solve. There is an inevitable tension
between product release plans and kernel engineering. Product release
cycles have no space for the "argument time" required to get features into
the mainline kernel. It is, he said, a social engineering problem that we
have to solve. It will certainly involve forking the kernel at times; the
important part is joining back with the mainline afterward. And, he asked,
do we really need to have everything in the kernel? Perhaps, in the case
of "throwaway devices" with short product lives, we don't really need to
have all that code upstream.
If we are going to scale the kernel across the diversity of contemporary
hardware, he said, we will have to maintain a strong focus on making our
code work on all systems. We'll have to continue to address the tensions
between mobile and server Linux, and we'll have to make efforts to cross
the kernel/user-space border and solve problems on both sides. This is a
discussion we will be having for some time, he said; events like the
Linux Plumbers Conference are the ideal place for that discussion.
Comments (25 posted)
By Jonathan Corbet
September 13, 2011
Approximately one year after describing bufferbloat to the world and
starting his campaign to remedy the problem, Jim Gettys traveled to the
2011 Linux Plumbers Conference to update the audience on the current state
of affairs. A lot of work is being done to address the bufferbloat
problem, but even more remains to be done.
"Bufferbloat" is the problem of excessive buffering used at all layers of
the network, from applications down to the hardware itself. Large buffers
can create obvious latency problems (try uploading a large file from a home
network while somebody else is playing a fast-paced network game and you'll
be able to measure the latency from the screams of frustration in the other
room), but the real issue is deeper than that. Excessive buffering wrecks
the control loop that enables implementations to maximize throughput
without causing excessive congestion on the net. The experience of the
late 1980's showed how bad a congestion-based collapse of the net can be;
the idea that bufferbloat might bring those days back is frightening to
many.
The initial source of the problem, Jim said, was the myth that dropping
packets is a bad thing to do combined with the fact that it is no longer
possible to buy memory in small amounts. The truth of the matter is that
the timely
dropping of packets is essential; that is how the network signals to
transmitters that they are sending too much data. The problem is
complicated with the use of the bandwidth-delay
product to size buffers. Nobody really knows what either the bandwidth
or the delay are for a typical network connection. Networks vary widely;
wireless networks can be made to vary considerably just by moving across
the room. In this environment, he said, no static buffer size can ever be
correct, but that is exactly what is being used at many levels.
As a result, things are beginning to break. Protocols that cannot handle
much in the way of delay or loss - DNS, ARP, DHCP, VOIP, or games, for
example - are beginning to suffer. A large proportion of broadband links,
Jim said, are "just busted." The edge of the net is broken, but the
problem is more widespread than that; Jim fears that bloat can be found
everywhere.
If static buffer sizes cannot work, buffers must be sized dynamically. The
RED protocol is meant to do
that sizing, but it suffers from one little problem: it doesn't actually
work. The problem, Jim said, is that the protocol knows about the size of
a given buffer, but it knows nothing about how quickly that buffer is
draining. Even so, it can improve the situation in some situations. But
it requires quite a bit of tuning to work right, so a lot of service
providers simply do not bother. Efforts to create an improved version of
RED are underway, but the results are not yet available.
A real solution to bufferbloat will have to be deployed across the entire
net. There are some things that can be done now; Jim has spent a lot of
time tweaking his home router to squeeze out excessive buffering. The
result, he said, involved throwing away a bit of bandwidth, but the
resulting network is a lot nicer to use. Some of the fixes are fairly
straightforward; Ethernet buffering, for example, should be proportional to
the link speed. Ring buffers used by network adapters should be reviewed
and reduced; he found himself wondering why a typical adapter uses the same
size for the transmit and receive buffers. There is also an extension to
the DOCSIS
standard in the works to allow ISPs to remotely tweak the amount of buffering
employed in cable modems.
A complete solution requires more than that, though. There are a lot of
hidden buffers out there in unexpected places; many of them will be hard to
find. Developers need to start thinking about buffers in terms of time,
not in terms of bytes or packets. And we'll need active queue management
in all devices and hosts; the only problem is that nobody really knows
which queue management algorithm will actually solve the problem. Steve
Hemminger noted that there are no good multi-threaded queue-management
algorithms out there.
CeroWRT
Jim yielded to Dave Täht, who talked about the CeroWRT router
distribution. Dave pointed out that, even when we figure out how to tackle
bufferbloat, we have a small problem: actually getting those fixes to
manufacturers and, eventually, users. A number of popular routers are
currently shipping with 2.6.16 kernels; it is, he said, the classic
embedded Linux problem.
One router distribution that is doing a better job of keeping up with the
mainline is OpenWRT. Appropriately,
CeroWRT is based on OpenWRT; its purpose is to complement
the debloat-testing kernel tree and provide
a platform for real-world testing of bufferbloat fixes. The goals behind
CeroWRT are to always be within a release or two of the mainline kernel,
provide reproducible results for network testing, and to be reliable enough
for everyday use while being sufficiently experimental to accept new stuff.
There is a lot of new stuff in CeroWRT. It has fixes to the packet
aggregation code used in wireless drivers that can, in its own right, be a
source of latency. The length of the transmit queues used in network
interfaces has been reduced to eight packets - significantly smaller than
the default values, which can be as high as 1000. That change alone is
enough, Dave said, to get quality-of-service processing working properly
and, he thinks, to push the real buffering bottleneck to the receive side
of the equation.
CeroWRT runs a tickless kernel, and enables protocol extensions like
explicit congestion notification (ECN), selective acknowledgments (SACK),
and duplicate SACK (DSACK) by default. A number of speedups have also been
applied to the core netfilter code.
CeroWRT also includes a lot of interesting software, including just about
every network testing tool the developers could get their hands on. Six
TCP congestion algorithms are available, with Westwood used by default.
Netem (a network emulator package)
has been put in to allow the simulation of packet loss and delay.
There is a bind9 DNS server with an extra-easy DNSSEC setup. Various mesh
networking protocols are supported. A lot of data collection and tracing
infrastructure has been added from the web10g project, but Dave has not yet
found a real use for the data.
All told, CeroWRT looks like a useful tool for validating work done in the
fight against bufferbloat. It has not yet reached its 1.0 release, though;
there are still some loose ends to tie and some problems to be fixed. For
now, it only works on the Netgear WNDR3700v2 router - chosen for its open
hardware and relatively large amount of flash storage. CeroWRT should be
ready for general use before too long; fixing the bufferbloat problem is
likely to take rather longer.
[Your editor would like to thank LWN's subscribers for supporting his
travel to LPC 2011.]
Comments (70 posted)
Patches and updates
Kernel trees
- Thomas Gleixner: 3.0.4-rt13 .
(September 12, 2011)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Page editor: Jonathan Corbet
Next page: Distributions>>