LWN.net Logo

Kernel development

Brief items

Kernel release status

The 2.6.38 merge window is open, so there is no current development kernel. See the separate article, below, for a summary of what has been merged to date.

Stable updates: the 2.6.36.3 and 2.6.32.28 updates were released on January 7; both contain a large number of important fixes.

The 2.6.34.8 update was released on January 6; it, too, contains lots of important fixes.

Comments (none posted)

Quotes of the week

A golden rule is that when a programmer reads some code, he should be able to understand why it's there. There is no way on this little earth that a programmer will be able to look at this code and say "ah-hah, that must be a workaround for gcc-4.5 -finline-functions!".
-- Andrew Morton

With almost every single kernel version bump we've had, we've found numerous places where strings in the logs have subtly changed, breaking our infrastructure :( Finding these is a painful process and often shows up late in the deployment process. This is disastrous to our deployment schedules as rebooting our number of machines, well, it takes a while...

I'm sorry if this is tangential to your comment above, but I feel the need to have folks recognize that printk() is _not_ a good foundation on which to build automation.

-- Mike Waychison

Comments (none posted)

The end of dcache_lock

By Jonathan Corbet
January 12, 2011
One of the most significant changes for 2.6.38 will not be an obviously user-visible feature; it is, instead, the dcache scalability patch set by Nick Piggin. These patches rework the virtual filesystem layer in some tricky ways, eliminating a number of longstanding scalability problems. The changes were significant enough that Linus paused the merge window for a couple of days after merging them, saying:

It's scary because this is some very core code, and the new RCU lookup model is way more clever and subtle than the old dentry_lock spinlock.

But it's rather impressive and I really wanted to merge it, because some of the performance numbers are pretty stunning. For example, a hot-cache "find . -size" on my home directory (which basically just does name lookups to get the stat information for every file recursively) became 35% faster. And that's the _unthreaded_ case. Not some odd high-end scalability thing, and not some recompiled binary taking advantage of new facilities. Pathname lookup is just simply faster.

This code mostly works, but early testers have reported an issue or two, and there are likely to be some subtle problems remaining still. This might be a good time for people running development kernels to have especially good backups. Also, as Nick noted, out-of-tree filesystems will need some changes to work with new VFS.

Comments (none posted)

How not to get a protocol implementation merged

By Jonathan Corbet
January 12, 2011
The "Open Base Station Architecture Initiative" is a consortium of companies which are trying to create an open market for cellular base station hardware. One of the things this initiative has defined is the UDPCP protocol - a UDP-based protocol used for communications between base stations. UDPCP offers reliable transfer, multicast, and more. The Linux kernel does not currently support UDPCP, but Stefani Seibold has posted a patch which would add that support to the kernel's network stack.

There have been a number of comments about this code, but one observation by Eric Dumazet is noteworthy: the posted implementation only works with IPv4. The networking developers have made it clear that they are uninterested in accepting an IPv4-only implementation in 2011; IPv6 support is required for any new code.

Stefani responded that no base stations currently provided IPv6 functionality and no customers were interested, so there was no point in adding that support at this time. The answer didn't change, though; the networking developers have no interest in merging code which is guaranteed to need fixing in the near future. Stefani has described this requirement as "dogmatic," but she also seems to have realized that it's not going to go away. So UDPCP stays out of the mainline for now, but we will, hopefully, eventually see a reworked version with support for IPv6.

Comments (13 posted)

Kernel development news

2.6.38 merge window part 1

By Jonathan Corbet
January 12, 2011
As of this writing, almost 5400 non-merge changesets have been pulled into the mainline for the 2.6.38 development cycle. This cycle looks to be another busy one, with a fair emphasis on the reworking of low-level infrastructure. User-visible changes merged so far include:

  • The per-session group scheduling patch has been merged. This change should yield better interactive response under a number of workloads.

  • The dcache scalability patch set has been merged. This tricky code can yield significant performance improvements for some types of filesystem-heavy workloads.

  • Kernel modules are finally loaded with read-only code on the x86 architecture; data is now non-executable across the entire kernel. See this article for more information on this change.

  • Transmit packet steering is now supported by the networking layer. This feature improves transmit performance by placing outgoing data on the proper (CPU-local) queue.

  • Support for the batman-adv mesh networking protocol has graduated from the staging tree and is now part of the main kernel.

  • The trusted and encrypted keys patch set has been merged.

  • The ext3 filesystem has gained support for batched discard operations and the FITRIM ioctl().

  • Emulation for the Video4Linux1 API has been removed from the kernel; any remaining V4L1 applications will need to be supported via a user-space library or converted to V4L2. Some unsupported V4L1 drivers (cpia, stradis) have been removed. The deprecated ibmcam, konicawc, and ultracam drivers have also been removed; those devices are now supported with GSPCA drivers.

  • New drivers include:

    • Systems and processors: Marvell PXA955 processors, Buffalo Linkstation Live v3 (LS-CHL) NAS drives, AM3517/05 CRANE boards, and SH-Mobile AG5 (R8A73A00) processors.

    • Block: MegaRAID 9265/9285 controllers, Acard ATP8620 host controllers, Marvell Dove SDHCI controllers, JMicron 388 SD/MMC controllers, Tegra SD/MMC controllers, and Synopsys DesignWare memory card interfaces.

    • Input: VTI CMA3000 Tri-axis accelerometers, Renesas ST1232 touchscreen controllers, Roccat Kone[+] gaming mice, Synaptics TM1217 touchscreen controllers, Synaptics i2c rmi4 touchscreens, and vast numbers of Analog Devices inclinometers, capacitive sensors, gyroscopes, etc.

    • Networking: USB NCM (network control model) class devices, USB serial CAN adapters using the LAWICEL ASCII protocol, Ralink RT33xx wireless chipsets (though the driver does not actually work yet), and Realtek RTL8192CE/RTL8188SE wireless network adapters.

    • Miscellaneous: SHARP LQ035Q7DB03 TFT LCD framebuffer devices, Blackfin ADV7393 video encoders, PXA3xx 2D graphics accelerators, TC3589X keypad controller modules, VIA VT8500 on-chip serial ports, Infineon 6x60 modems, Intel EG20T UARTs, Sensirion SHT21 humidity and temperature sensors, Dallas Semiconductor DS620 sensors, and Discretix SEP security processors.

    • USB: Marvell PXA9xx Processor USB2.0 controllers, Intel EG20T (Topcliff) USB device controllers, Altec Lansing AB8500 USB transceivers, Qualcomm on-chip USB OTG controllers, and TI twl6030-usb transceivers.

    • Video4Linux: Fujitsu mb86a20s ISDB-T/ISDB-Tsb demodulators, Timberdale "Video In" devices, TI WL1273 FM radios, and OmniVision OV2640 sensors.

Changes visible to kernel developers include:

  • Flags can now be specified for tracepoints with the TRACE_EVENT_FLAGS() macro. The initial flag of interest is TRACE_EVENT_FL_CAP_ANY, which allows the tracepoint to be used by unprivileged users; this flag has been applied to the system call tracepoints.

  • The perf trace command has been renamed to perf script.

  • Conditional tracepoints are now supported by the kernel.

  • A number of tracepoints relating to power management events have been changed; see Documentation/trace/events-power.txt for more information. New tracepoints have been added to the RCU subsystem and the Radeon video driver.

  • The cancel_rearming_delayed_work() and cancel_rearming_delayed_workqueue() functions are deprecated and will be removed in 2.6.39.

  • The kernel build system is now able to link device tree blobs directly into the kernel image.

  • There is a new capability bit (CAP_SYSLOG) which controls access to the system log.

  • A new function, kref_sub(), allows code to return multiple references in a single call.

  • The external data representation (XDR) API has changed to incorporate the "xdr_stream" concept; all in-kernel users have been updated. XDR streams provide an improved resistance to buffer overflows, increasing the security of protocol implementations using XDR.

  • The "timerlist" infrastructure has been added for kernel subsystems which must manage lists of timers. See <linux/timerlist.h> for an overview of the API.

  • The direct rendering core has gained the ability to handle precise vertical blanking timestamps; this feature has been used by some drivers to improve their OpenML extension conformance.

The merge window can be expected to remain open until roughly January 19. That leaves plenty of time for other interesting features to find their way into the mainline; next week we'll summarize changes for the second half of the 2.6.38 merge window.

Comments (2 posted)

The CHOKe packet scheduler

By Jonathan Corbet
January 11, 2011
A packet on the network typically passes through several machines on the way from its source to its destination. One of those machines (or, more correctly, one of the outbound links from one of those machines) will be the limiting factor on how many packets can traverse that path in a given period of time. If a system tries to send too many packets through the limiting link, the packet queue on the router attached to that link will grow. A growing queue affects other users of that router and will eventually hit its limits, causing packet loss.

The TCP protocol has, for many years, included congestion control algorithms which attempt to determine the carrying capacity of a path and to avoid exceeding that capacity. These algorithms have successfully prevented a repeat of the meltdowns which plagued the early Internet. But congestion control isn't working as well as it should, for a few reasons. Some TCP implementations are more dutiful than others when it comes to congestion control. An increasing amount of traffic on the net uses other protocols (UDP in particular) which do not have congestion control built into them. Excessive queue sizes in routers ("bufferbloat") can also disguise congestion problems until it is too late. All of these problems are motivating a search for better ways of controlling congestion.

The key signal for congestion control on the net is dropped packets; TCP will continue to ramp up its transmit rate until the occasional lost packet makes it clear that the limit has been hit. So the way for a router in the middle of the network to tell a specific sender that it's transmitting too much data is to drop some of that data on the floor. The idea is simple in concept, but it can be harder in practice for a simple reason: network routers can deal with many thousands of packets every second; they cannot afford to spend significant amounts of time on any one of those packets.

An obvious way for a router to schedule packets would be to maintain a queue for every flow (source/destination pair with port numbers) through the system. Packets could then be dequeued and transmitted with absolute fairness, and, any time the queue for a flow gets too long, packets could be dropped from that queue. Implementing this algorithm would require some complex data structures and a fair amount of processor time, though, so it is not an option for a router which handles a significant amount of traffic.

An alternative is the CHOKe algorithm [PS]; CHOKe stands either for "CHOose and Kill" or "CHOose and Keep," depending on one's attitude toward the problem. Stephen Hemminger has recently posted a CHOKe implementation for Linux, so this seems like a good time to look at how this algorithm works.

CHOKe is intended for points where multiple flows come together - routers and bridges, primarily. The idea behind CHOKe is to keep the length of transmit queues under control and to penalize flows with excessive traffic while avoiding the need to maintain any sort of per-flow state. To that end, the packet queuing algorithm works essentially like the following. When a packet arrives for a given outbound link, the CHOKe code will:

  • Calculate a moving average of the length of the queue. The algorithm includes a parameter for the period over which the average is calculated; a longer period will allow longer load spikes before the algorithm starts CHOKing traffic.

  • If the average queue length is below a minimum watermark, there is no problem with congestion, so the packet will simply be queued and the job is done.

  • If the queue length is above the minimum, the CHOKe algorithm picks a random packet from the queue. If that packet belongs to the same flow as the packet under consideration, both packets will be dropped. When a randomly-picked packet comes from the same flow, chances are good that packets from that flow occupy a substantial amount of queue space, so that flow is likely to be a source of the problem.

  • If the packets are from different flows, but the queue length is above an administrator-set maximum, then the new packet (only) will be dropped.

  • In the final case, the algorithm calculates a probability that the packet will be dropped, even though the maximum queue length has not been reached. The probability grows as the queue length increases, but, by default, it remains low - about 2% at the maximum. Thus, for mid-length queues, the algorithm will occasionally send a signal to a transmitter that it should back off a bit, but most packets will be queued normally.

The key feature of CHOKe - the one which distinguishes it from RED (from which it is derived) - is the check against a random packet in the queue. That is a heuristic mechanism for identifying problematic flows without actually having to track what each flow is doing. Experience, as reported in the CHOKe paper, suggest that it works pretty well.

An important factor in successful use of CHOKe in the real world will be careful selection of the controlling parameters: the minimum and maximum queue lengths, average period, and drop probability. In particular, there is mounting evidence (thanks to the efforts by Jim Gettys) that overly long queues lead to all kinds of pathological network behavior, and could even threaten a net collapse at some point. Use of algorithms like CHOKe, combined with reasonably-sized queues, could help keep the Internet working well into the future.

Comments (26 posted)

Extending the use of RO and NX

By Jake Edge
January 12, 2011

Pages of memory that are managed by the kernel are governed by access control flags that are somewhat analogous to the permissions which are applied to files. Those flags govern whether the page can be written to and whether its contents can be executed. Both attributes are useful to restrict what can happen to those pages in the presence of programming errors or security attacks. A pair of patches that were merged in the current merge window will further extend the usage of these flags for the x86 architecture.

The page access flags, unlike file permissions, are enforced by the memory management hardware. The flags of interest for these patches are "write" and "execute", both of which imply "read" access, so they are often specified as follows: RO+X (read-only and execute) or RW+NX (read-write and no-execute). By restricting the usage of these pages, the scope of security flaws can be reduced because, for example, a buffer overflow in an NX page will not be directly useful for code execution.

The memory that is used by the kernel to hold its read-only data (i.e. the .rodata segment) has been able to be marked read-only since 2.6.16 in early 2006, depending on the setting of CONFIG_DEBUG_RODATA. In 2.6.25, the kernel .rodata segment was additionally marked NX (i.e. no-execute), but only for the x86_64 architecture. A patch that was originally created for 2.6.30 (for both the 32 and 64-bit x86 architectures) expanded the use of NX for all kernel data pages, including read-write sections for initialized data and BSS.

That patch was created by Siarhei Liakh and Xuxian Jiang but had fallen by the wayside after causing some boot crashes on one of Ingo Molnar's test systems. When Kees Cook brought up the idea of doing better page access protection of the kernel's memory, Molnar remembered that Matthieu Castet had "dusted off those patches and submitted two of them", back in August. After a few iterations, Molnar pulled them into the -tip tree, and Linus Torvalds pulled that for the mainline in the current 2.6.38 merge window.

The revised patch itself is fairly straightforward. If CONFIG_DEBUG_RODATA is set, various sections of the kernel (.text and .rodata) are page aligned for both their start and end addresses. The NX bit is set for all pages from the end of the .text (i.e. code) section to the _end address that marks the end of the kernel's data section.

There were two other pieces of the puzzle addressed in the patch, the first of which was presumably the cause of the boot crashes that Molnar had with the earlier patch. Some older systems that use PCI BIOS require that some pages in the 640K-1M region be executable. There are also some ISA mappings that require read-write access to that region. Rather than try to work all of that out, and potentially run afoul of buggy hardware, the patch just sets pages in that region to be RW+X on systems where PCI BIOS is used. The second change simply modifies free_init_pages() to turn on NX for any pages that are freed that way, so that those pages have to be explicitly allowed to store executable code when they are reused.

A related patch adds read-only and no-execute flags to the pages used by kernel modules. It came from the same developers, and seems to have been dropped from -tip along with the NX patch. And, like the other patch, Castet pushed it the last bit to finally get it included in the mainline.

The patch splits the module_core and module_init regions into three parts: code, read-only data, and read-write data. Each of those parts is page aligned and the page access permissions are set just before load_module() returns. For the code pieces, RO+X are set, while the data parts get NX and either RO or RW depending on the type of data. These changes are all governed by the setting of CONFIG_DEBUG_SET_MODULE_RONX.

Beyond setting the page access control flags at module load time, the kernel must also reset those flags to RW+NX when the module is unloaded. In addition, the module_init region is freed after initialization is completed and its pages need to be put back to RW+NX. There is one further wrinkle: Ftrace needs to be able to modify the code in modules to enable tracepoints, so the patch provides a means for all module text pages to be set RW while Ftrace is making those changes, and then to set them back to RO afterward.

Marking the kernel module pages as RO and/or NX is important not only because it is consistent with how the rest of the kernel pages are handled, but also because it makes other kernel protection efforts actually work for modules. For example, there has been an effort to declare structures of function pointers as const, so that exploits cannot change the pointers for their own nefarious purposes, but that only works if the .rodata pages are actually marked RO.

The main cost of these patches is some bits of wasted memory from page aligning the various sections. Since that cost is probably not significant for any but the most resource-constrained embedded systems, it would make sense for CONFIG_DEBUG_RODATA and CONFIG_DEBUG_SET_MODULE_RONX to be turned on for most distributions—or to default to "on", though that is generally frowned upon by Torvalds and others.

The fact that these patches have been around for a while, but never quite made the jump into the mainline is unfortunate. There is no real person or group that is currently shepherding core kernel security patches along, though Cook and Dan Rosenberg have recently been making an effort to push these kinds of changes. Cook's query helped resurrect both of these patches; they might have languished far longer without that interest.

It is also worth noting that much or all of the protections embodied in these patches have long been available in the grsecurity/PaX kernels. While no wholesale import of the features from those kernels is ever going to happen, piecemeal patches that implement "sane" (at least in Torvalds's eyes) features can be adopted. That should lead to better kernel security, which is something that is certainly worth shooting for.

Comments (7 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds