LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.1-rc6, released on September 14. Things continue to move slowly in the absence of kernel.org, so there aren't that many changes this time around. "Nothing really stands out. Have at it, and let us know of any outstanding regressions." The repository is still hosted at Github, naturally.

Stable updates: no stable updates have been released in the last week.

Comments (none posted)

Quotes of the week

In short, spatch files can be used on target directories to generate patches. spdiff can read a patch file and generate an spatch file for you. What this means for the backporting world is if you backport one evolutionary change in the Linux kernel for one driver you can then backport the same change for *all* drivers. This is a quantum leap in terms of effort required to backport.
-- Luis Rodriguez makes backporting easy

I'm not sure derivative works law is quite so clear cut, but then 'provide a clear concise definition of derivative works' appears to be the legal version of The Goldbach Conjecture.
-- Alan Cox

Comments (none posted)

Kernel development without kernel.org

By Jonathan Corbet
September 13, 2011
The security problems at kernel.org have raised concerns about the kernel source and other software hosted there. There has been no evidence, so far, that kernel.org was used to distribute any corrupted software. But there is another aspect to this breakin: kernel.org is "down for maintenance" and there is no word as to when it might come back. As a result, even if no malware was distributed, the kernel.org crack represents a denial of service attack of significant proportions.

Linus has released two 3.1-rc versions from a temporary site at Github, but there's not a lot of work to be found there. Among other things, the loss of all the repositories hosted on kernel.org means that there is relatively little for him to pull. Stephen Rothwell, meanwhile, continues to pull the trees he can reach to create linux-next. He is able to report integration and build problems, but cannot put the tree where others can reach it. "Besides, I am having a nice restful time." There have been no stable tree updates since kernel.org went down.

Alternative trees are beginning to pop up across the net as developers find other places to host their work for now. If the kernel.org outage continues for some time, we can expect to see many more of those show up - though some developers are refusing to set up alternative repositories. Most of the substitute trees are described as temporary; it will be interesting to see how many of them actually move back to kernel.org once this episode has run its course. Some developers may decide that keeping their trees elsewhere works better for them. We may have a distributed source control system, but it has become clear that the kernel community works with a rather centralized hosting and distribution infrastructure.

The loss of kernel.org has slowed things enough to make it clear that the process has a single point of failure built into it. Whether that is worth fixing is not entirely clear; no code should have been lost and, if kernel.org were ever to disappear permanently, the process could be back to full speed on other systems in short order. For now, though, we're seeing things disrupted in a way few other events have been able to manage. It's interesting to ponder on what would have happened had the compromise come out during the merge window.

Comments (8 posted)

Kernel development news

LPC: Making the net go faster

By Jonathan Corbet
September 13, 2011
Almost every service offered by Google is delivered over the Internet, so it makes sense that the company would have an interest in improving how the net performs. The networking session at the 2011 Linux Plumbers Conference featured presentations from three Google developers, each of whom had a proposal for a significant implementation change. Between the three, it seems, there is still a lot of room for improvement in how we do networking.

Proportional rate reduction

The "congestion window" is a TCP sender's idea of how much data it can have in flight to the other end before it starts to overload a link in the middle. Dropped packets are often a sign that the congestion window is too large, so TCP implementations normally reduce the window significantly when loss happens. Cutting the congestion window will reduce performance, though; if the packet loss was a one-time event, that slowdown will be entirely unnecessary. RFC 3517 describes an algorithm for bringing the connection up to speed quickly after a lost packet, but, Nandita Dukkipati says, we can do better.

According to Nandita, a large portion of the network sessions involving Google's servers experience losses at some point; the ones that do can take 7-10 times longer to complete. RFC 3517 is part of the problem. This algorithm responds to a packet loss by immediately cutting the congestion window in half; that means that the sending system must, if the congestion window had been full at the time of the loss, wait for ACKs for half of the in-transit packets before transmitting again. That causes the sender to go silent for an extended period of time. It works well enough in simple cases (a single packet lost in a long-lasting flow), but it tends to clog up the works when dealing with short flows or extended packet losses.

Linux does not use strict RFC 3517 now; it uses, instead, an enhancement called "rate halving." With this algorithm, the congestion window is not halved immediately. Once the connection goes into loss recovery, each incoming ACK (which will typically acknowledge the receipt of two packets at the other end) will cause the congestion window to be reduced by a single packet. Over the course of one full set of in-flight packets, the window will be cut in half, but the sending system will continue to transmit (at a lower rate) while that reduction is happening. The result is a smoother flow and reduced latency.

But rate halving can be improved upon. The ACKs it depends on are themselves subject to loss; an extended loss can cause significant reduction of the congestion window and slow recovery. This algorithm also does not even begin the process of raising the congestion window back to the highest workable value until the recovery process is complete. So it can take quite a while to get back up to full speed.

The proportional rate reduction algorithm takes a different approach. The first step is to calculate an estimate for the amount of data still in flight, followed by a calculation of what, according to the congestion control algorithm in use, the congestion window should now be. If the amount of data in the pipeline is less than the target congestion window, the system just goes directly into the TCP slow start algorithm to bring the congestion window back up. Thus, when the connection experiences a burst of losses, it will start trying to rebuild the congestion window right away instead of creeping along with a small window for an extended period.

If, instead, the amount of data in flight is at least as large as the new congestion window, an algorithm similar to rate halving is used. The actual reduction is calculated relative to the new congestion window, though, rather than being a strict one-half cut. For both large and small losses, the emphasis on using estimates of the amount of in-flight data instead of counting ACKs is said to make recovery go more smoothly and to avoid needless reductions in the congestion window.

How much more better is it? Nandita said that Google has been running experiments on some of its systems; the result has been a 3-10% reduction in average latency. Recovery timeouts have been reduced by 5%. This code is being deployed more widely on Google's servers; it also has been accepted for merging during the 3.2 development cycle. More information can be found in this draft RFC.

TCP fast open

Opening a TCP connection requires a three-packet handshake: a SYN packet sent by the client, a SYN-ACK response from the server, and a final ACK from the client. Until the handshake is complete, the link can carry no data, so the handshake imposes an unavoidable startup latency on every connection. But what would happen, asked Yuchung Cheng, if one were to send data with the handshake packets? For simple transactions - an HTTP GET request followed by the contents of a web page, for example - sending the relevant data with the handshake packets would eliminate that latency. The result of this thought is the "TCP fast open" proposal.

RFC 793 (describing TCP) does allow data to be passed with the handshake packets, with the proviso that the data not be passed to applications until the handshake completes. One can consider fudging that last requirement to speed the process of transmitting data through a TCP connection, but there are some hazards to be dealt with. An obvious problem is the amplification of SYN flood attacks, which are bad enough when they only involve the kernel; if each received SYN packet were to take up application resources as well, the denial of service possibilities would be significantly worse.

Yuchung described an approach to fast open which is intended to get around most of the problems. The first step is the creation of a per-server secret which is hashed with information from each client to create a per-client cookie. That cookie is sent to the client as a special option on an ordinary SYN-ACK packet; the client can keep it and use it for fast opens in the future. The requirement to get a cookie first is a low bar for the prevention of SYN flood attacks, but it does make things a little harder. In addition, the server's secret is changed relatively often, and, if the server starts to see too many connections, fast open will simply be disabled until things calm down.

One remaining problem is that about 5% of the systems on the net will drop SYN packets containing unknown options or data. There is little to be done in this situation; TCP fast open simply will not work. The client must thus remember cases where the fast-open SYN packet did not get through and just use ordinary opens in the future.

Fast open will not happen by default; applications on both ends of the connection must specifically request it. On the client side, the sendto() system call is used to request a fast-open connection; with the new MSG_FAST_OPEN flag, it functions like the combination of connect() and sendmsg(). On the server side, a setsockopt() call with the TCP_FAST_OPEN option will enable fast opens. Either way, applications need not worry about dealing with the fast-open cookies and such.

In Google's testing, TCP fast open has been seen to improve page load times by anything between 4% and 40%. This technique works best in situations where the round trip time is high, naturally; the bigger the latency, the more value there is in removing it. A patch implementing this feature will be submitted for inclusion sometime soon.

Briefly: user-space network queues

While the previous two talks were concerned with improving the efficiency of data transfer over the net, Willem de Bruijn is concerned with network processing on the local host. In particular, he is working with high-end hardware: high-speed links, numerous processors, and, importantly, smart network adapters that can recognize specific flows and direct packets to connection-specific queues. By the time the kernel gets around to thinking about a given packet at all, it will already be sorted into the proper place, waiting for the application to ask for the data.

Actual processing of the packets will happen in the context of the receiving process as needed. So it all happens in the right context and on the right CPU; intermediate processing at the software IRQ level will be avoided. Willem even described a new interface whereby the application would receive packets directly from the kernel via a shared memory segment.

In other words, this talk described a variant of the network channels concept, where packet processing is pushed as close to the application as possible. There are numerous details to be dealt with, including the usual hangups for the channels idea: firewall processing and such. The proposed use of a file in sysfs to pass packets to user space also seems unlikely to pass review. But this work may eventually reach a point where it is generally useful; those who are interested can find the patches on the unetq page.

Comments (11 posted)

LPC: Coping with hardware diversity

By Jonathan Corbet
September 14, 2011
As Linaro's CTO, David Rusling spends a lot of time observing the interactions between the ARM architecture and the mainline kernel development community. In his Linux Plumbers Conference 2011 keynote, David made the point that ARM's diversity is behind many of the problems that have made themselves felt in recent years. Much is being done to align the ARM community with how the kernel works, but the kernel, too, is going to have to change if it will successfully address the challenges posed by increasingly diverse hardware.

David started with a brief note to the effect that he dislikes the "embedded" term. If a system is connected to the Internet, he said, it is no longer embedded. Now that everything is so connected, it is time to stop using that term, and time to stop having separate conferences for embedded developers. It's all just Linux now.

ARM brings diversity

ARM is a relative newcomer to the industry, having been born in 1990 as part of a joint venture between Acorn, VLSI, and Apple. The innovative aspect to ARM was its licensing model; rather than being a processor produced by a single manufacturer, ARM is a processor design that is licensed to many manufacturers. The overall architecture for systems built around [David Rusling] ARM is not constrained by that license, so each vendor creates its own platform to meet its particular needs. The result has been a lot of creativity and variety in the hardware marketplace, and a great deal of commercial success. David estimated that each attendee in the room was carrying about ten ARM processors; they show up in phones (several of them, not just "the" processor), in disk controllers, in network interfaces, etc.

Since each vendor can create a new platform (or more than one), there is no single view of what makes an ARM processor. Developers working with ARM usually work with a single vendor's platform and tend not to look beyond that platform. They are also working under incredibly tight deadlines; four months from product conception to availability on the shelves is not uncommon. There is a lot of naivety about open source software, its processes, and the licensing. In this setting, David said, fragmentation was inevitable. Linaro has been formed in response in an attempt to help the ARM community work better with the kernel development community; its prime mission is to bring about some consolidation in the ARM code base. Beyond that, he said, Linaro seeks to promote collaboration; without that, the community will be able to achieve very little. Companies working in the ARM space recognize the need to collaborate, but they are sometimes less clear on just which problems they should be trying to solve.

Once upon a time, Microsoft was the dominant empire and Linux was the upstart rebel child. Needless to say, Linux has been successful in many areas; it is now settling, he said, into a comfortable middle age. But this has all happened in the context of the PC architecture, which is not particularly diverse, so Linux, too, is not hugely diverse. It's also worth noting that, in this environment, hardware does not ship until Windows runs on it; making Linux work is often something that comes afterward.

The mobile world is different; Android, he said, has become the de facto standard mobile Linux distribution. It has become known for its "fork, rebase, repeat" development cycle. Android runs on systems with highly integrated graphics and media processors, and it is developed with an obsession about battery lifetime. In this world, things have turned around: now the hardware will not ship until Linux runs on it. Given the time pressures involved, it is no wonder, he said, that forking happens.

In the near future we are going to see the arrival of ARM-based server systems; that is going to stir things up again. They will be very different from existing servers - and from each other; the diversity of the ARM world will be seen again. There will be a significant long-term impact on the kernel as a result. For example, scheduling will have to become much more aware of power management and thermal management issues. Low power use will always be a concern, even in the server environment.

Problems to solve

Making all of this work is going to require greater collaboration between the ARM and kernel communities. ARM developers are developing the habits needed to work with upstream; the situation is much better than it was a few years ago. But we are going to need a lot more kernel developers with an ARM background, and they are going to have to get together and talk to each other more often. Some of that is beginning to happen; Linaro is trying to help with this process.

A big problem to deal with, he said, was boot architecture: what happens on the system before the kernel runs. Regardless of architecture, the boot systems are all broken and all secret; developers hate them. In the end we have to communicate system information to the kernel; now we are using features like ACPI or techniques like flattened device trees. We are seeing new standards (like UEFI) emerging, but, he asked, are we influencing those standards enough?

Taking things further: will there be a single ARM platform such that one kernel can run on any system? The answer was "maybe," but, if so, it is going to take some time. We're currently in a world where we have many such platforms - OMAP, iMX, etc. - and pulling them together will be hard. We need to teach ARM developers that not all code they develop belongs in their platform tree - or in arch/arm at all. The process of looking for patterns and turning them into generic code must continue. The ARM community is working toward the goal of creating a generic kernel; there are lots of interesting challenges to face, but other architectures have faced them before.

One step in the right direction is the recent creation of the arm-soc tree, managed by Arnd Bergmann. The goal of this tree is to support Russell King (the top-level ARM maintainer) and the platform maintainers and to increase the efficiency of the whole process. The arm-soc tree has become the path for much of the ARM consolidation work to get into the mainline kernel.

Returning briefly to power management, David noted that ARM-based systems usually have no fans. The kernel needs a better thermal management framework to keep the whole thing from melting. And that framework will [David Rusling] have to reach throughout the kernel; the scheduler may, for example, need to move processes away from an overheating core to allow it to cool down. Everywhere we look, he said, we need better instrumentation so we have a better idea of what is happening with the hardware.

More efficient buffer management is a high priority for ARM devices; copying data uses power and generates heat, so copying needs to be avoided whenever possible. But existing kernel mechanisms are not always a good match to the ARM world, where one can encounter a plethora of memory management units, weakly-ordered memory, and more. There are a lot of solutions in the works, including CMA, a reworked DMA mapping framework, and more, but they are not all yet upstream.

In summary, we have some problems to solve. There is an inevitable tension between product release plans and kernel engineering. Product release cycles have no space for the "argument time" required to get features into the mainline kernel. It is, he said, a social engineering problem that we have to solve. It will certainly involve forking the kernel at times; the important part is joining back with the mainline afterward. And, he asked, do we really need to have everything in the kernel? Perhaps, in the case of "throwaway devices" with short product lives, we don't really need to have all that code upstream.

If we are going to scale the kernel across the diversity of contemporary hardware, he said, we will have to maintain a strong focus on making our code work on all systems. We'll have to continue to address the tensions between mobile and server Linux, and we'll have to make efforts to cross the kernel/user-space border and solve problems on both sides. This is a discussion we will be having for some time, he said; events like the Linux Plumbers Conference are the ideal place for that discussion.

Comments (25 posted)

LPC: An update on bufferbloat

By Jonathan Corbet
September 13, 2011
Approximately one year after describing bufferbloat to the world and starting his campaign to remedy the problem, Jim Gettys traveled to the 2011 Linux Plumbers Conference to update the audience on the current state of affairs. A lot of work is being done to address the bufferbloat problem, but even more remains to be done.

"Bufferbloat" is the problem of excessive buffering used at all layers of the network, from applications down to the hardware itself. Large buffers can create obvious latency problems (try uploading a large file from a home network while somebody else is playing a fast-paced network game and you'll be able to measure the latency from the screams of frustration in the other room), but the real issue is deeper than that. Excessive buffering wrecks [Jim Gettys] the control loop that enables implementations to maximize throughput without causing excessive congestion on the net. The experience of the late 1980's showed how bad a congestion-based collapse of the net can be; the idea that bufferbloat might bring those days back is frightening to many.

The initial source of the problem, Jim said, was the myth that dropping packets is a bad thing to do combined with the fact that it is no longer possible to buy memory in small amounts. The truth of the matter is that the timely dropping of packets is essential; that is how the network signals to transmitters that they are sending too much data. The problem is complicated with the use of the bandwidth-delay product to size buffers. Nobody really knows what either the bandwidth or the delay are for a typical network connection. Networks vary widely; wireless networks can be made to vary considerably just by moving across the room. In this environment, he said, no static buffer size can ever be correct, but that is exactly what is being used at many levels.

As a result, things are beginning to break. Protocols that cannot handle much in the way of delay or loss - DNS, ARP, DHCP, VOIP, or games, for example - are beginning to suffer. A large proportion of broadband links, Jim said, are "just busted." The edge of the net is broken, but the problem is more widespread than that; Jim fears that bloat can be found everywhere.

If static buffer sizes cannot work, buffers must be sized dynamically. The RED protocol is meant to do that sizing, but it suffers from one little problem: it doesn't actually work. The problem, Jim said, is that the protocol knows about the size of a given buffer, but it knows nothing about how quickly that buffer is draining. Even so, it can improve the situation in some situations. But it requires quite a bit of tuning to work right, so a lot of service providers simply do not bother. Efforts to create an improved version of RED are underway, but the results are not yet available.

A real solution to bufferbloat will have to be deployed across the entire net. There are some things that can be done now; Jim has spent a lot of time tweaking his home router to squeeze out excessive buffering. The result, he said, involved throwing away a bit of bandwidth, but the resulting network is a lot nicer to use. Some of the fixes are fairly straightforward; Ethernet buffering, for example, should be proportional to the link speed. Ring buffers used by network adapters should be reviewed and reduced; he found himself wondering why a typical adapter uses the same size for the transmit and receive buffers. There is also an extension to the DOCSIS standard in the works to allow ISPs to remotely tweak the amount of buffering employed in cable modems.

A complete solution requires more than that, though. There are a lot of hidden buffers out there in unexpected places; many of them will be hard to find. Developers need to start thinking about buffers in terms of time, not in terms of bytes or packets. And we'll need active queue management in all devices and hosts; the only problem is that nobody really knows which queue management algorithm will actually solve the problem. Steve Hemminger noted that there are no good multi-threaded queue-management algorithms out there.

CeroWRT

Jim yielded to Dave Täht, who talked about the CeroWRT router distribution. Dave pointed out that, even when we figure out how to tackle bufferbloat, we have a small problem: actually getting those fixes to manufacturers and, eventually, users. A number of popular routers are currently shipping with 2.6.16 kernels; it is, he said, the classic embedded Linux problem.

One router distribution that is doing a better job of keeping up with the mainline is OpenWRT. Appropriately, CeroWRT is based on OpenWRT; its purpose is to complement the debloat-testing kernel tree and provide a platform for real-world testing of bufferbloat fixes. The goals behind [Dave Täht] CeroWRT are to always be within a release or two of the mainline kernel, provide reproducible results for network testing, and to be reliable enough for everyday use while being sufficiently experimental to accept new stuff.

There is a lot of new stuff in CeroWRT. It has fixes to the packet aggregation code used in wireless drivers that can, in its own right, be a source of latency. The length of the transmit queues used in network interfaces has been reduced to eight packets - significantly smaller than the default values, which can be as high as 1000. That change alone is enough, Dave said, to get quality-of-service processing working properly and, he thinks, to push the real buffering bottleneck to the receive side of the equation. CeroWRT runs a tickless kernel, and enables protocol extensions like explicit congestion notification (ECN), selective acknowledgments (SACK), and duplicate SACK (DSACK) by default. A number of speedups have also been applied to the core netfilter code.

CeroWRT also includes a lot of interesting software, including just about every network testing tool the developers could get their hands on. Six TCP congestion algorithms are available, with Westwood used by default. Netem (a network emulator package) has been put in to allow the simulation of packet loss and delay. There is a bind9 DNS server with an extra-easy DNSSEC setup. Various mesh networking protocols are supported. A lot of data collection and tracing infrastructure has been added from the web10g project, but Dave has not yet found a real use for the data.

All told, CeroWRT looks like a useful tool for validating work done in the fight against bufferbloat. It has not yet reached its 1.0 release, though; there are still some loose ends to tie and some problems to be fixed. For now, it only works on the Netgear WNDR3700v2 router - chosen for its open hardware and relatively large amount of flash storage. CeroWRT should be ready for general use before too long; fixing the bufferbloat problem is likely to take rather longer.

[Your editor would like to thank LWN's subscribers for supporting his travel to LPC 2011.]

Comments (70 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Architecture-specific

Security-related

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds