The current development kernel is 3.1-rc1
on August 7. According to
Notable? It depends on what you look for. VM writeback work? You
got it. And there was some controversy over the iscsi target
code. There's networking changes, there's the rest of the generic
ACL support moving into the VFS layer proper, simplifying the
filesystem code that was often cut-and-paste duplicated boiler
plate. And making us faster at doing it at the same time. And there
are power management interface cleanups.
But there's nothing *huge* here. Looks like a fairly normal release,
as I said. Unless I've forgotten something.
All the details can be
found in the
Stable updates: 3.0.1 was released
on August 4 with a long list of fixes. 126.96.36.199 and 188.8.131.52 came out on August 8. Greg has
let it be known that the maintenance period for the 2.6.33 kernel may be
coming to an end before too long.
Comments (none posted)
We just have to understand, that preempt_disable, bh_disable,
irq_disable are cpu local BKLs with very subtle semantics which are
pretty close to the original BKL horror. All these mechanisms are
per cpu locks in fact and we need to get some annotation in place
which will help understandability and debugability in the first
place. The side effect that it will help RT to deal with that is -
of course desired from our side - but not the primary goal of that
-- Thomas Gleixner
In fact, I'm seriously considering a rather draconian measure for
next merge window: I'll fetch the -next tree when I open the merge
window, and if I get anything but trivial fixes that don't show up
in that "next tree at the point of merge window open", I'll just
ignore that pull request. Because clearly people are just not being
-- Linus Torvalds
We shouldn't do voodoo stuff. Or rather, I'm perfectly ok if you
guys all do your little wax figures of me in the privacy of your
own homes - freedom of religion and all that - but please don't do
it in the kernel.
-- Linus Torvalds
Every time I get frustrated with doing paperwork, I simply imagine
having the job of estimating how much time it takes to do
paperwork, and I feel better immediately.
Comments (none posted)
Kernel hacker Mel Gorman has released a test suite for the Linux memory management subsystem. He has cleaned up some scripts that he uses and made them less specific to particular patch sets. While not "comprehensive in any way
", they may be useful to others. He has also published some raw results on tests that he has run recently. "I know the report structure looks crude but I was not interested
in making them pretty. Due to the fact that some of the scripts
are extremely old, the quality and coding styles vary considerably.
This may get cleaned up over time but in the meantime, try and keep
the contents of your stomach down if you are reading the scripts.
Full Story (comments: 1)
Kernel development news
The 3.1 kernel will include a number of enhancements to the
system call by Tejun Heo. These improvements are meant
to make reliable debugging of programs easier, but Tejun, it seems, is not
one to be satisfied with mundane objectives like that. So he has posted an example program
showing how the new
features can be used to solve a difficult problem faced by
checkpoint/restart implementations: capturing and restoring the state of
network connections. The code is in an early stage of development; it's
audacious and scary, but it may show how interesting things can be done.
The traditional ptrace() API calls for a tracing program to attach
to a target process with the PTRACE_ATTACH command; that command
puts the target into a traced state and stops it in its tracks.
PTRACE_ATTACH has never been perfect; it changes the target's
signal handling and can never be entirely transparent to the target. So
Tejun supplemented it with a new PTRACE_SEIZE command;
PTRACE_SEIZE attaches to the target but does not stop it or change
its signal handling in any way. Stopping a seized process is done with
PTRACE_INTERRUPT which, again, does not send any signals or make
any signal handling changes. The result is a mechanism which enables the
manipulation of processes in a more transparent, less disruptive way.
All of this seems useful, but it does not necessarily seem like part of a
checkpoint/restart implementation. But it can help in an important way.
One of the problems associated with saving the state of a process is that
not all of that state is visible from user space. Getting around this
limitation has tended to involve doing checkpointing from within the kernel
or the addition of new interfaces to expose the required information;
neither approach is seen as ideal. But, in many cases, the required
information can be had by running in the context of the targeted process;
that is where an approach based on ptrace() can have a role to
Tejun took on the task of saving and restoring the state of an open TCP
connection for his example implementation. The process starts by using
ptrace() to seize and stop the target thread(s); then it's just a
matter of running some code in that process's context to get the requisite
information. To do so, Tejun's example program digs around in the target's
address space for a nice bit of memory which has execute permission; the
contents of that memory are saved and replaced by his "parasite" code. A
bit of register manipulation allows the target process to be restarted in
the injected code, which does the needed information gathering. Once
that's done, the original code and registers are restored, and the target
process is as it was before all this happened.
The "parasite" code starts by gathering the basic information about open
connections: IP addresses, ports, etc. The state of the receive side of
each connection is saved by (1) copying any buffered incoming data
using the MSG_PEEK option to recvmsg(), and
(2) getting the sequence number to be read next with a new
SIOCGINSEQ ioctl() command. On the transmit side, the
sequence number of each queued outgoing packet - along with the packet data
itself must be captured with another pair of new ioctl()
commands. With that done, the checkpointing of the network connection is
Restarting the connection - possibly in a different process on a different
machine entirely - is a bit tricky; the kernel's idea of the connection
must be made to match the situation at checkpoint time without perturbing
or confusing the other side. That requires the restart code to pretend to
be the other side of the connection for as long as it takes to get things
in sync. The kernel already provides most of the machinery needed for this
task: outgoing packets can be intercepted with the "nf_queue" mechanism,
and a raw socket can be used to inject new packets that appear to be coming
from the remote side.
So, at restart time, things start by simply opening a new socket to the
remote end. Another new ioctl() command (SIOCSOUTSEQ) is
used to set the sequence number before connecting to make it match the
number found at checkpoint time. Once the connection process starts, the
outgoing SYN packet will be intercepted - the remote side will certainly
not be prepared to deal with it - and a SYN/ACK reply will be injected
locally. The outgoing ACK must also be intercepted and dropped on the
floor, of course. Once that is done, the kernel thinks it has an open
connection, with sequence numbers matching the pre-checkpoint connection,
to the remote side.
After that, it's a matter of restoring the incoming data that had been
found queued in the kernel at checkpoint time; that is done by injecting
new packets containing that data and intercepting the resulting ACKs from
the network stack. Outgoing data, instead, can be replaced with a series
of simple send() calls, but there is one little twist. Packets in
the outgoing queue may have already been transmitted and received by the
remote side. Retransmitting those packets is not a problem, as long as the
size of those packets remains the same. If, instead, the system uses different
offsets as it divides the outgoing data into packets, it can create
confusion at the remote end. To keep that from happening, Tejun added one
more ioctl() (SIOCFORCEOUTBD) to force the packets to
match those created before the checkpoint operation began.
Once the transmit queue is restored, the connection is back to its original
state. At this point, the interception of outgoing packets can stop.
All of this seems somewhat complex and fragile, but Tejun states that it
"actually works rather reliably." That said, there are a lot
of details that have been ignored; it is, after all, a proof-of-concept
implementation. It's not meant to be a complete solution to the problem of
restarting network connections; the idea is to show that the problem can,
indeed, be solved. If the user-space
checkpoint/restart work proceeds, it may well adopt some variant of
this approach at some point. In the meantime, though, what we have is a
fun hack showing what can be done with the new ptrace() commands.
Those wanting more details on how it works can find them in the
README file found in the example code repository.
Comments (35 posted)
In the beginning was the BIOS.
Actually, that's not true. Depending on where you start from, there
was either some toggle switches used to enter enough code to start
booting from something useful, a ROM that dumped you straight into a
language interpreter or a ROM that was just barely capable of reading
a file from tape or disk and going on from there. CP/M was usually one
of the latter, jumping to media that contained some hardware-specific
code and a relatively hardware-agnostic OS. The hardware-specific code
handled receiving and sending data, resulting in it being called the
"Basic Input/Output System." BIOS was born.
When IBM designed the PC they made a decision that probably seemed
inconsequential at the time but would end up shaping the entire PC
industry. Rather than leaving the BIOS on the boot media, they tied it
to the initial bootstrapping code and put it in ROM. Within a couple
of years vendors were shipping machines with reverse engineered BIOS
reimplementations and the PC clone market had come into existence.
There's very little beauty associated with the BIOS, but what it had
in its favor was functional hardware abstraction. It was possible to
write a fairly functional operating system using only the interfaces
provided by the system and video BIOSes, which meant that vendors
could modify system components and still ship unmodified install
media. Prices nosedived and the PC became almost ubiquitous.
The BIOS grew along with all of this. Various arbitrary limits were
gradually removed or at least papered over. We gained interfaces for
telling us how much RAM the system had above 64MB. We gained support
for increasingly large drives. Network booting became possible. But
The one that eventually cemented the argument for moving away from the
traditional BIOS turned out
to be a very old problem. Hard drives still typically have 512 byte
sectors, and the MBR partition table used by BIOSes stores sectors in
32-bit variables. Partitions above 2TB? Not really happening. And
while in the past this would have been an excuse to standardize on
another BIOS extension, the world had changed. The legacy BIOS had
lasted for around 30 years without ever having a full
specification. The modern world wanted standards, compliance tests and
management capabilities. Something clearly had to be done.
And so for the want of a new partition table standard, EFI arrived in
the PC world.
Expedient Firmware Innovation
 Intel's other stated objection to Open Firmware was that it had
its own device tree which would have duplicated the ACPI device tree
that was going to be present in IA64 systems. One of the outcomes of
the OLPC project was an Open Firmware implementation that glued the
ACPI device tree into the Open Firmware one without anyone dying in
the process, while meanwhile EFI ended up allowing you to specify
devices in either the ACPI device tree or through a runtime enumerated
hardware path. The jokes would write themselves if they weren't too
 To be fair to Intel, choosing to have drivers be written in C
rather than Forth probably did make EFI more attractive to third party
developers than Open Firmware
Intel had at least 99 problems in 1998, and IA64 was certainly one of
them. IA64 was supposed to be a break from the PC compatible market,
and so it made sense for it to have a new firmware implementation. The
90s had already seen several attempts at producing cross-platform
legacy-free firmware designs with the most notable probably being the
ARC standard that appeared on various MIPS and Alpha platforms and
Open Firmware, common on PowerPC and SPARCs. ARC mandated the presence
of certain hardware components and lacked any real process for
extending the specification, so got passed over. Open Firmware was
more attractive but had a very limited third party developer
community, so the choice was made to start from scratch in the hope
that a third party developer community would be along
eventually. This was the Intel Boot Initiative, something that
would eventually grow into EFI.
EFI is intended to fulfill the same role as the old PC BIOS. It's a
pile of code that initializes the hardware and then provides a
consistent and fairly abstracted view of the hardware to the
operating system. It's enough to get your bootloader running and, then, for
that bootloader to find the rest of your OS. It's a specification
that's 2,210 pages long and still depends on the additional 727 pages
of the ACPI spec and numerous ancillary EFI specs. It's a standard
for the future that doesn't understand surrogate pairs and so can
never implement full Unicode support. It has a scripting environment
that looks more like DOS than you'd have believed possible. It's built
on top of a platform-independent open source core that's already
something like three times the size of a typical BIOS source
tree. It's the future of getting anything to run on your PC. This is
Eminently Forgettable Irritant
 The latest versions of EFI allow for a pre-PEI phase that verifies
that the EFI code hasn't been modified. We heard you like layers.
 Those of you paying attention have probably noticed that the PEI
sounds awfully like a BIOS, EFI sounds awfully like an OS and
bootloaders sound awfully like applications. There's nothing standing
between EFI and EMACS except a C library and a port of readline. This
probably just goes to show something, but I'm sure I don't know what.
The theory behind EFI is simple. At the lowest level is the Pre-EFI
Initialization (PEI) code, whose job it is to handle setting up the
low-level hardware such as the memory controller. As the entry point to
the firmware, the PEI layer also handles the first stages of resume
from S3 sleep. PEI then transfers control to the Driver Execution
Environment (DXE) and plays no further part in the running system.
The DXE layer is what's mostly thought of as EFI. It's a hardware-agnostic
core capable of loading drivers from the Firmware Volume
(effectively a filesystem in flash), providing a standardized set of
interfaces to everything that runs on top of it. From here it's a
short step to a bootloader and UI, and then you're off out of EFI and
you don't need to care any more.
The PEI is mostly uninteresting. It's the chipset-level secret sauce
that knows how to turn a system without working RAM into a system with
working RAM, which is a fine and worthy achievement but not typically
something an OS needs to care about. It'll bring your memory out of
self refresh and jump to the resume vector when you're coming out of
S3. Beyond that? It's an implementation detail. Let's ignore it.
The DXE is where things get interesting. This is the layer that
presents the interface embodied in the EFI specification. Devices with
bound drivers are represented by handles, and each handle may
implement any number of protocols. Protocols are uniquely identified
with a GUID. There's a LocateHandle() call that gives you a reference
to all handles that implement a given protocol, but how do you make
the LocateHandle() call in the first place?
This turns out to be far easier than it could be. Each EFI protocol is
represented by a table (ie, a structure) of data and function
pointers. There's a couple of special tables which represent boot
services (ie, calls that can be made while you're still in DXE) and
runtime services (ie, calls that can be made once you've transitioned
to the OS), and in turn these are contained within a global system
table. The system table is passed to the main function of any EFI
application, and walking it to find the boot services table then gives
a pointer to the LocateHandle() function.
So you're an EFI bootloader and you want to print something on the
This is made even easier by the presence of basic console io
functions in the global EFI system table, avoiding the need to search
for an appropriate protocol. A "Hello World" function would look something
efi_main (EFI_HANDLE image, EFI_SYSTEM_TABLE *systab)
conout = systab->ConOut;
uefi_call_wrapper(conout->OutputString, 2, conout, L"Hello World!\n\r");
In comparison, graphics require slightly more effort:
extern EFI_GUID GraphicsOutputProtocol;
efi_main (EFI_HANDLE image, EFI_SYSTEM_TABLE *systab)
uefi_call_wrapper(BS->LocateProtocol, 3, &GraphicsOutputProtocol,
uefi_call_wrapper(gop->QueryMode, 4, gop, 0, &SizeOfInfo, &info);
Print(L"Mode 0 is running at %dx%d\n", info->HorizontalResolution,
 Well, except that things are obviously more complicated. It's
possible for multiple device handles to implement a single protocol,
so you also need to work out whether you're speaking to the right
one. That can end up being trickier than you'd like it to be.
Here we've asked the firmware for the first instance of a device
implementing the Graphics Output Protocol. That gives us a table of
pointers to graphics related functionality, and we're free to call
them as we please.
Extremely Frustrating Issues
So far it all sounds straightforward from the bootloader
perspective. But EFI is full of surprising complexity and frustrating
corner cases, and so (unsurprisingly) attempting to work on any of
this rapidly leads to confusion, anger and a hangover. We'll explore
more of the problems in the next part of this article.
Comments (52 posted)
Network performance depends heavily on buffering at almost every point in
a packet's path. If the system wants to get full performance out of an
interface, it must ensure that the next packet is ready to go as soon as
the device is ready for it. But, as the developers working on bufferbloat
have confirmed, excessive buffering can lead to problems of its own. One
of the most annoying of those problems is latency; if an outgoing packet is
placed at the end of a very long queue, it will not be going anywhere for a
while. A classic example can be reproduced on almost any home network:
start a large outbound file copy operation and listen to the loud
complaints from the World of Warcraft player in the next room; it should be
noted that not all parents see this behavior as a bad thing. But, in
general, latency caused by excessive buffering is indeed worth fixing.
One assumes that the number of Warcraft players on the Google campus is
relatively small, but Google worries about latency anyway. Anything that
slows down response makes Google's services slower and less attractive.
So it is not surprising that we have seen various latency-reducing changes
from Google, including the increase in the
initial congestion window merged for 2.6.38. A more recent patch from Google's Tom Herbert
attacks latency caused by excessive buffering, but its future in its
current form is uncertain.
An outgoing packet may pass through several layers of buffering before it
hits the wire for even the first hop. There may be queues within the
originating application, in the network protocol code, in the traffic
control policy layers, in the device driver, and in the device itself - and
probably in several other places as well. A full solution to the buffering
problem will likely require addressing all of these issues, but each layer
will have its own concerns and will be a unique problem to solve. Tom's
patch is aimed at the last step in the system - buffering within the
device's internal transmit queue.
Any worthwhile network interface will support a ring of descriptors
describing packets which are waiting to be transmitted. If the interface
is busy, there should always be some packets buffered there; once the
transmission of one packet is complete, the interface should be able to
begin the next one without waiting for the kernel to respond. It makes
little sense, though, to buffer more packets in the device than is
necessary to keep the transmitter busy; anything more than that will just
add latency. Thus far, little thought has gone into how big that buffer
should be; the default is often too large. On your editor's system,
ethtool says that the length of the transmit ring is 256 packets;
on a 1G Ethernet, with 1500-byte packets, that ring would take almost 4ms
to transmit completely. 4ms is a fair amount of latency to add to a local
transmission, and it's only one of several possible sources of latency. It
may well make sense to make that buffer smaller.
The problem, of course, is that the ideal buffer size varies considerably
from one system - and one workload - to the next. A lightly-loaded system
sending large packets can get by with a small number of buffered packets. If
the system is heavily loaded, more time may pass before the transmit queue
can be refilled, so that queue should be larger. If the packets being
are small, it will be necessary to buffer more of them. A few moments
spent thinking about the problem will make it clear that (1) the
number of packets is the wrong parameter to use for the size of the queue,
and (2) the queue length must be a dynamic parameter that responds to
the current load on the system. Expecting system administrators to tweak
transmit queue lengths manually seems like a losing strategy.
Tom's patch adds a new "dynamic queue limits" (DQL) library that is meant to be a
general-purpose queue length controller; on top of that he builds the "byte
queue limits" mechanism used within the networking layer. One of the key
observations is that the limit should be expressed in bytes rather than
packets, since the number of queued bytes more accurately approximates the
time required to empty the queue. To use this code, drivers must, when
queueing packets to the interface, make a
call to one of:
void netdev_sent_queue(struct net_device *dev, unsigned int pkts, unsigned int bytes);
void netdev_tx_sent_queue(struct netdev_queue *dev_queue, unsigned int pkts,
unsigned int bytes);
Either of these functions will note that the given number of bytes
have been queued to the given device. If the underlying DQL code
determines that the queue is long enough after adding these bytes, it will
tell the upper layers to pass no more data to the device for now.
When a transmission completes, the driver should call one of:
void netdev_completed_queue(struct net_device *dev, unsigned pkts, unsigned bytes);
void netdev_tx_completed_queue(struct netdev_queue *dev_queue, unsigned pkts,
The DQL library will respond by reenabling the flow of packets into the
driver if the length of the queue has fallen far enough.
In the completion routine, the DQL code also occasionally tries to adjust
the queue length for optimal performance. If the queue becomes empty while
transmission has been turned off in the networking code, the queue is
clearly too short - there was not time to get more packets into the stream
before the transmitter came up dry. On the other hand, if the queue length
never goes below a given number of bytes, the maximum length can probably
be reduced by up to that many bytes. Over time, it is hoped that this
algorithm will settle on a reasonable length and that it will be able to
respond if the situation changes and a different length is called for.
The idea behind this patch makes sense, so nobody spoke out against it.
Stephen Hemminger did express concerns about the need to add explicit calls
to drivers to make it all work, though. The API for network drivers is
already complex; he would like to avoid making it more so if possible.
Stephen thinks that it should be possible to watch traffic flowing through
the device at the higher levels and control the queue length without any
knowledge or cooperation from the driver at all; Tom is not yet convinced
that this will work. It will probably take some time to figure out what
the best solution is, and the code could end up changing significantly
before we see dynamic transmit queue length control get into the mainline.
Comments (19 posted)
Patches and updates
- Peter Zijlstra: 3.0-rt7 .
(August 6, 2011)
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>