The 2.6.38 merge window is open
, so there is no current development
kernel. See the separate article, below, for a summary of what has been
merged to date.
Stable updates: the 188.8.131.52 and 184.108.40.206 updates were released on
January 7; both contain a large number of important fixes.
The 220.127.116.11 update was released on
January 6; it, too, contains lots of important fixes.
Comments (none posted)
A golden rule is that when a programmer reads some code, he should
be able to understand why it's there. There is no way on this
little earth that a programmer will be able to look at this code
and say "ah-hah, that must be a workaround for gcc-4.5
-- Andrew Morton
With almost every single kernel version bump we've had, we've found
numerous places where strings in the logs have subtly changed,
breaking our infrastructure :( Finding these is a painful process
and often shows up late in the deployment process. This is
disastrous to our deployment schedules as rebooting our number of
machines, well, it takes a while...
I'm sorry if this is tangential to your comment above, but I feel
the need to have folks recognize that printk() is _not_ a good
foundation on which to build automation.
-- Mike Waychison
Comments (none posted)
One of the most significant changes for 2.6.38 will not be an obviously
user-visible feature; it is, instead, the dcache scalability patch set
by Nick Piggin.
These patches rework the virtual filesystem layer in some tricky ways,
eliminating a number of longstanding scalability problems. The changes
were significant enough that Linus paused the
for a couple of days after merging them, saying:
It's scary because this is some very core code, and the new RCU
lookup model is way more clever and subtle than the old dentry_lock
But it's rather impressive and I really wanted to merge it, because
some of the performance numbers are pretty stunning. For example, a
hot-cache "find . -size" on my home directory (which basically just
does name lookups to get the stat information for every file
recursively) became 35% faster. And that's the _unthreaded_
case. Not some odd high-end scalability thing, and not some
recompiled binary taking advantage of new facilities. Pathname
lookup is just simply faster.
This code mostly works, but early testers have reported an issue or two,
and there are likely to be some subtle problems remaining still. This
might be a good time for people running development kernels to have
especially good backups. Also, as Nick noted, out-of-tree filesystems will need some
changes to work with new VFS.
Comments (none posted)
The "Open Base Station Architecture Initiative" is a consortium of
companies which are trying to create an open market for cellular base
station hardware. One of the things this initiative has defined is the
UDPCP protocol - a UDP-based protocol used for communications between base
stations. UDPCP offers reliable transfer, multicast, and more. The Linux
kernel does not currently support UDPCP, but Stefani Seibold has posted a patch
which would add that support to the
kernel's network stack.
There have been a number of comments about this code, but one observation by Eric Dumazet is noteworthy: the
posted implementation only works with IPv4. The networking developers have
made it clear that they are uninterested in accepting an IPv4-only
implementation in 2011; IPv6 support is required for any new code.
Stefani responded that no base stations
currently provided IPv6 functionality and no customers were interested, so
there was no point in adding that support at this time. The answer didn't
change, though; the networking developers have no interest in merging code
which is guaranteed to need fixing in the near future. Stefani has
described this requirement as
"dogmatic," but she also seems to have realized that it's not
going to go away. So UDPCP stays out of the mainline for now, but we
will, hopefully, eventually see a reworked version with support for IPv6.
Comments (13 posted)
Kernel development news
As of this writing, almost 5400 non-merge changesets have been pulled into
the mainline for the 2.6.38 development cycle. This cycle looks to be
another busy one, with a fair emphasis on the reworking of low-level
infrastructure. User-visible changes merged so far include:
- The per-session group scheduling patch
has been merged. This change should yield better interactive response
under a number of workloads.
- The dcache scalability patch set has
been merged. This tricky code can yield significant performance
improvements for some types of filesystem-heavy workloads.
- Kernel modules are finally loaded with read-only code on the x86
architecture; data is now non-executable across the entire kernel.
See this article for more information
on this change.
- Transmit packet steering is now
supported by the networking layer. This feature improves transmit
performance by placing outgoing data on the proper (CPU-local) queue.
- Support for the batman-adv mesh networking protocol has graduated from
the staging tree and is now part of the main kernel.
- The trusted and encrypted keys patch
set has been merged.
- The ext3 filesystem has gained support for batched discard operations and the
- Emulation for the Video4Linux1 API has been removed from the kernel;
any remaining V4L1 applications will need to be supported via a
user-space library or converted to V4L2. Some unsupported V4L1
drivers (cpia, stradis) have been removed. The deprecated ibmcam,
konicawc, and ultracam drivers have also been removed; those devices
are now supported with GSPCA drivers.
- New drivers include:
- Systems and processors: Marvell PXA955 processors,
Buffalo Linkstation Live v3 (LS-CHL) NAS drives,
AM3517/05 CRANE boards, and
SH-Mobile AG5 (R8A73A00) processors.
- Block: MegaRAID 9265/9285 controllers,
Acard ATP8620 host controllers,
Marvell Dove SDHCI controllers,
JMicron 388 SD/MMC controllers,
Tegra SD/MMC controllers, and
Synopsys DesignWare memory card interfaces.
- Input: VTI CMA3000 Tri-axis accelerometers,
Renesas ST1232 touchscreen controllers,
Roccat Kone[+] gaming mice,
Synaptics TM1217 touchscreen controllers,
Synaptics i2c rmi4 touchscreens, and
vast numbers of Analog Devices inclinometers, capacitive
sensors, gyroscopes, etc.
- Networking: USB NCM (network control model) class devices,
USB serial CAN adapters using the LAWICEL ASCII protocol,
Ralink RT33xx wireless chipsets (though the driver does not
actually work yet), and
Realtek RTL8192CE/RTL8188SE wireless network adapters.
- Miscellaneous: SHARP LQ035Q7DB03 TFT LCD framebuffer devices,
Blackfin ADV7393 video encoders,
PXA3xx 2D graphics accelerators,
TC3589X keypad controller modules,
VIA VT8500 on-chip serial ports,
Infineon 6x60 modems,
Intel EG20T UARTs,
Sensirion SHT21 humidity and temperature sensors,
Dallas Semiconductor DS620 sensors, and
Discretix SEP security processors.
- USB: Marvell PXA9xx Processor USB2.0 controllers,
Intel EG20T (Topcliff) USB device controllers,
Altec Lansing AB8500 USB transceivers,
Qualcomm on-chip USB OTG controllers, and
TI twl6030-usb transceivers.
- Video4Linux: Fujitsu mb86a20s ISDB-T/ISDB-Tsb demodulators,
Timberdale "Video In" devices,
TI WL1273 FM radios, and
OmniVision OV2640 sensors.
Changes visible to kernel developers include:
- Flags can now be specified for tracepoints with the TRACE_EVENT_FLAGS() macro. The
initial flag of interest is TRACE_EVENT_FL_CAP_ANY, which
allows the tracepoint to be used by unprivileged users; this flag has
been applied to the system call tracepoints.
- The perf trace command has been renamed to perf
- Conditional tracepoints are now
supported by the kernel.
- A number of tracepoints relating to power management events have been
changed; see Documentation/trace/events-power.txt
information. New tracepoints have been added to the RCU subsystem and
the Radeon video driver.
- The cancel_rearming_delayed_work() and
cancel_rearming_delayed_workqueue() functions are deprecated
and will be removed in 2.6.39.
- The kernel build system is now able to link device tree blobs directly
into the kernel image.
- There is a new capability bit (CAP_SYSLOG) which controls
access to the system log.
- A new function, kref_sub(), allows code to return multiple
references in a single call.
- The external data representation (XDR) API has changed to incorporate
the "xdr_stream" concept; all in-kernel users have been
updated. XDR streams provide an improved resistance to buffer
overflows, increasing the security of protocol implementations using
- The "timerlist" infrastructure has been added for kernel subsystems
which must manage lists of timers. See
<linux/timerlist.h> for an overview of the API.
- The direct rendering core has gained the ability to handle precise
vertical blanking timestamps; this feature has been used by some
drivers to improve their OpenML extension conformance.
The merge window can be expected to remain open until roughly
January 19. That leaves plenty of time for other interesting features
to find their way into the mainline; next week we'll summarize changes for
the second half of the 2.6.38 merge window.
Comments (2 posted)
A packet on the network typically passes through several machines on the way from
its source to its destination. One of those machines (or, more correctly,
one of the outbound links from one of those machines) will be the limiting
factor on how many packets can traverse that path in a given period of
time. If a system tries to send too many packets through the limiting
link, the packet queue on the router attached to that link will grow. A
growing queue affects other users of that router and will eventually hit
its limits, causing packet loss.
The TCP protocol has, for many years, included congestion control
algorithms which attempt to determine the carrying capacity of a path and
to avoid exceeding that capacity. These algorithms have successfully prevented
a repeat of the meltdowns which plagued the early Internet. But congestion
control isn't working as well as it should, for a few reasons. Some TCP
implementations are more dutiful than others when it comes to congestion
control. An increasing amount of traffic on the net uses other protocols
(UDP in particular) which do not have congestion control built into them.
Excessive queue sizes in routers ("bufferbloat") can
also disguise congestion problems until it is too late. All of these
problems are motivating a search for better ways of controlling congestion.
The key signal for congestion control on the net is dropped packets; TCP
will continue to ramp up its transmit rate until the occasional lost packet
makes it clear that the limit has been hit. So the way for a router in the
middle of the network to tell a specific sender that it's transmitting too
much data is to drop some of that data on the floor. The idea is simple in
concept, but it can be harder in practice for a simple reason: network
routers can deal with many thousands of packets every second; they cannot
afford to spend significant amounts of time on any one of those packets.
An obvious way for a router to schedule packets would be to maintain a
queue for every flow (source/destination pair with port numbers) through
the system. Packets could then be dequeued and transmitted with absolute
fairness, and, any time the queue for a flow gets too long, packets could
be dropped from that queue. Implementing this algorithm would require some
complex data structures and a fair amount of processor time, though, so it
is not an option for a router which handles a significant amount of
An alternative is the CHOKe algorithm
[PS]; CHOKe stands either for "CHOose and Kill" or "CHOose and Keep,"
depending on one's attitude toward the problem. Stephen Hemminger has
recently posted a CHOKe implementation for
Linux, so this seems like a good time to look at how this algorithm works.
CHOKe is intended for points where multiple flows come together - routers
and bridges, primarily.
The idea behind CHOKe is to keep the length of transmit queues under
control and to penalize flows with excessive traffic while avoiding the
need to maintain any sort of per-flow state. To that end, the packet
queuing algorithm works essentially like the following. When a packet
arrives for a given outbound link, the CHOKe code will:
- Calculate a moving average of the length of the queue. The algorithm
includes a parameter for the period over which the average is
calculated; a longer period will allow longer load spikes before the
algorithm starts CHOKing traffic.
- If the average queue length is below a minimum watermark, there is no
problem with congestion, so the packet will simply be queued and
the job is done.
- If the queue length is above the minimum, the CHOKe algorithm picks a
random packet from the queue. If that packet belongs to the same flow
as the packet under consideration, both packets will be
dropped. When a randomly-picked packet comes from the same flow,
chances are good that packets from that flow occupy a substantial
amount of queue space, so that flow is likely to be a source of the
- If the packets are from different flows, but the queue length is above
an administrator-set maximum, then the new packet (only) will be dropped.
- In the final case, the algorithm calculates a probability that the
packet will be dropped, even though the maximum queue length has not
been reached. The probability grows as the queue length increases,
but, by default, it remains low - about 2% at the maximum. Thus, for
mid-length queues, the algorithm will occasionally send a signal to a
transmitter that it should back off a bit, but most packets will be
The key feature of CHOKe - the one which distinguishes it from RED (from
which it is derived) - is the check against a random packet in the queue.
That is a heuristic mechanism for identifying problematic flows without
actually having to track what each flow is doing. Experience, as reported
in the CHOKe paper, suggest that it works pretty well.
An important factor in successful use of CHOKe in the real world will
be careful selection of the controlling parameters: the minimum and maximum
queue lengths, average period, and drop probability. In particular, there
is mounting evidence (thanks to the efforts by Jim Gettys) that overly long
queues lead to all kinds of pathological network behavior, and could even
threaten a net collapse at some point. Use of algorithms like CHOKe,
combined with reasonably-sized queues, could help keep the Internet working
well into the future.
Comments (26 posted)
of memory that are
managed by the
kernel are governed by access control flags that are somewhat analogous to
the permissions which are applied to files. Those flags govern whether
the page can be written to and whether its contents can be executed. Both
attributes are useful to restrict what can happen to those pages in the
programming errors or security attacks. A pair of patches that were
merged in the current merge window will further extend the usage of these
flags for the x86 architecture.
The page access flags, unlike file permissions, are enforced by
the memory management hardware. The flags of interest for these patches
are "write" and "execute", both of which imply "read" access, so they are
often specified as follows: RO+X (read-only and execute) or RW+NX (read-write
and no-execute). By restricting the usage of these pages, the scope of
security flaws can be reduced because, for example, a buffer overflow in an
NX page will not be directly useful for code execution.
The memory that is used by the kernel to hold its read-only data (i.e. the
has been able to be marked read-only since 2.6.16 in early 2006, depending
CONFIG_DEBUG_RODATA. In 2.6.25, the kernel .rodata
segment was additionally marked NX (i.e. no-execute), but only for the
x86_64 architecture. A patch that was
originally created for 2.6.30 (for both the 32 and 64-bit x86 architectures)
expanded the use of NX for all kernel data pages, including read-write
sections for initialized
data and BSS.
That patch was created by Siarhei Liakh and Xuxian Jiang but had fallen by
the wayside after causing some boot
crashes on one of Ingo Molnar's test systems. When Kees Cook brought
up the idea of doing better page access protection of the kernel's memory,
that Matthieu Castet had "dusted off those patches and submitted two
of them", back in August. After a few iterations, Molnar pulled
them into the -tip tree, and Linus Torvalds pulled that for the mainline in
2.6.38 merge window.
The revised patch itself is fairly
straightforward. If CONFIG_DEBUG_RODATA is set, various sections
of the kernel (.text and .rodata) are page aligned for
both their start and end addresses.
The NX bit is set for all pages from the end of the .text
(i.e. code) section to the _end address that marks the end of the
kernel's data section.
There were two other pieces of the puzzle addressed in the patch, the first
of which was presumably the cause of the boot crashes that Molnar
had with the earlier patch. Some older systems that use PCI BIOS require
that some pages in the 640K-1M region be executable. There are also some
ISA mappings that require read-write access to that region. Rather than
try to work all of that out, and potentially run afoul of buggy hardware,
the patch just sets pages in that region to be RW+X on systems where PCI
BIOS is used. The second
change simply modifies free_init_pages() to turn on NX for any
are freed that way, so that those pages have to be explicitly allowed to
store executable code when they are reused.
patch adds read-only and no-execute flags to the pages used by
kernel modules. It came from the same developers, and seems to have been
dropped from -tip along with the NX patch. And, like the other patch, Castet pushed it the
last bit to finally get it included in the mainline.
The patch splits the module_core and module_init regions into three parts: code,
read-only data, and read-write data. Each of those parts is page aligned
and the page access permissions are set just before load_module()
returns. For the code pieces, RO+X are set, while the
data parts get NX and either RO or RW depending on
the type of data.
These changes are all governed by the setting of
Beyond setting the page access control flags at module load time, the
kernel must also reset those flags to RW+NX when the module is unloaded.
In addition, the module_init region is freed after
initialization is completed and its pages need to be put back to RW+NX.
There is one further wrinkle: Ftrace needs to be able to modify the code
in modules to enable tracepoints, so the patch provides a means for all
module text pages to be set RW while Ftrace is making those changes, and
then to set them back to RO afterward.
Marking the kernel module pages as RO and/or NX is important not only because
it is consistent with how the rest of the kernel pages are handled, but
also because it makes other kernel protection efforts actually work for
modules. For example, there has been an effort to declare structures of function
pointers as const, so that exploits cannot change the pointers for
their own nefarious purposes, but that only works if the .rodata
pages are actually marked RO.
The main cost of these patches is some bits of wasted memory from page aligning
the various sections. Since that cost is probably not significant for any but
the most resource-constrained embedded systems, it would make sense for
CONFIG_DEBUG_RODATA and CONFIG_DEBUG_SET_MODULE_RONX to
be turned on for most distributions—or to default to "on", though
that is generally frowned upon by Torvalds and others.
The fact that these patches have been around for a while, but never quite
made the jump into the mainline is unfortunate. There is no real person or
group that is currently shepherding core kernel security patches along,
and Dan Rosenberg have recently been making an effort to push these kinds
of changes. Cook's query helped resurrect both of
these patches; they might have languished far longer without that
It is also worth noting that much or all of the protections embodied in
these patches have long been available in the grsecurity/PaX kernels.
While no wholesale import of the features from those kernels is ever going
to happen, piecemeal patches that implement "sane" (at least in Torvalds's eyes) features can be adopted.
That should lead to better kernel security, which is something that is
Comments (7 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>