Kernel development

Brief items

Kernel release status

The current development kernel is 3.1-rc1, released on August 7. According to Linus:

Notable? It depends on what you look for. VM writeback work? You got it. And there was some controversy over the iscsi target code. There's networking changes, there's the rest of the generic ACL support moving into the VFS layer proper, simplifying the filesystem code that was often cut-and-paste duplicated boiler plate. And making us faster at doing it at the same time. And there are power management interface cleanups.

But there's nothing *huge* here. Looks like a fairly normal release, as I said. Unless I've forgotten something.

All the details can be found in the long-format changelog.

Stable updates: 3.0.1 was released on August 4 with a long list of fixes. 2.6.32.44 and 2.6.33.17 came out on August 8. Greg has let it be known that the maintenance period for the 2.6.33 kernel may be coming to an end before too long.

Comments (none posted)

Quotes of the week

We just have to understand, that preempt_disable, bh_disable, irq_disable are cpu local BKLs with very subtle semantics which are pretty close to the original BKL horror. All these mechanisms are per cpu locks in fact and we need to get some annotation in place which will help understandability and debugability in the first place. The side effect that it will help RT to deal with that is - of course desired from our side - but not the primary goal of that exercise.

-- Thomas Gleixner

In fact, I'm seriously considering a rather draconian measure for next merge window: I'll fetch the -next tree when I open the merge window, and if I get anything but trivial fixes that don't show up in that "next tree at the point of merge window open", I'll just ignore that pull request. Because clearly people are just not being careful enough.

-- Linus Torvalds

We shouldn't do voodoo stuff. Or rather, I'm perfectly ok if you guys all do your little wax figures of me in the privacy of your own homes - freedom of religion and all that - but please don't do it in the kernel.

-- Linus Torvalds

Every time I get frustrated with doing paperwork, I simply imagine having the job of estimating how much time it takes to do paperwork, and I feel better immediately.

-- Valerie Aurora

Comments (none posted)

Mel Gorman releases MMTests 0.01

Kernel hacker Mel Gorman has released a test suite for the Linux memory management subsystem. He has cleaned up some scripts that he uses and made them less specific to particular patch sets. While not "comprehensive in any way", they may be useful to others. He has also published some raw results on tests that he has run recently. "I know the report structure looks crude but I was not interested in making them pretty. Due to the fact that some of the scripts are extremely old, the quality and coding styles vary considerably. This may get cleaned up over time but in the meantime, try and keep the contents of your stomach down if you are reading the scripts."

Full Story (comments: 1)

Kernel development news

TCP connection hijacking and parasites - as a good thing

By Jonathan Corbet
August 9, 2011

The 3.1 kernel will include a number of enhancements to the ptrace() system call by Tejun Heo. These improvements are meant to make reliable debugging of programs easier, but Tejun, it seems, is not one to be satisfied with mundane objectives like that. So he has posted an example program showing how the new features can be used to solve a difficult problem faced by checkpoint/restart implementations: capturing and restoring the state of network connections. The code is in an early stage of development; it's audacious and scary, but it may show how interesting things can be done.

The traditional ptrace() API calls for a tracing program to attach to a target process with the PTRACE_ATTACH command; that command puts the target into a traced state and stops it in its tracks. PTRACE_ATTACH has never been perfect; it changes the target's signal handling and can never be entirely transparent to the target. So Tejun supplemented it with a new PTRACE_SEIZE command; PTRACE_SEIZE attaches to the target but does not stop it or change its signal handling in any way. Stopping a seized process is done with PTRACE_INTERRUPT which, again, does not send any signals or make any signal handling changes. The result is a mechanism which enables the manipulation of processes in a more transparent, less disruptive way.

All of this seems useful, but it does not necessarily seem like part of a checkpoint/restart implementation. But it can help in an important way. One of the problems associated with saving the state of a process is that not all of that state is visible from user space. Getting around this limitation has tended to involve doing checkpointing from within the kernel or the addition of new interfaces to expose the required information; neither approach is seen as ideal. But, in many cases, the required information can be had by running in the context of the targeted process; that is where an approach based on ptrace() can have a role to play.

Tejun took on the task of saving and restoring the state of an open TCP connection for his example implementation. The process starts by using ptrace() to seize and stop the target thread(s); then it's just a matter of running some code in that process's context to get the requisite information. To do so, Tejun's example program digs around in the target's address space for a nice bit of memory which has execute permission; the contents of that memory are saved and replaced by his "parasite" code. A bit of register manipulation allows the target process to be restarted in the injected code, which does the needed information gathering. Once that's done, the original code and registers are restored, and the target process is as it was before all this happened.

The "parasite" code starts by gathering the basic information about open connections: IP addresses, ports, etc. The state of the receive side of each connection is saved by (1) copying any buffered incoming data using the MSG_PEEK option to recvmsg(), and (2) getting the sequence number to be read next with a new SIOCGINSEQ ioctl() command. On the transmit side, the sequence number of each queued outgoing packet - along with the packet data itself must be captured with another pair of new ioctl() commands. With that done, the checkpointing of the network connection is complete.

Restarting the connection - possibly in a different process on a different machine entirely - is a bit tricky; the kernel's idea of the connection must be made to match the situation at checkpoint time without perturbing or confusing the other side. That requires the restart code to pretend to be the other side of the connection for as long as it takes to get things in sync. The kernel already provides most of the machinery needed for this task: outgoing packets can be intercepted with the "nf_queue" mechanism, and a raw socket can be used to inject new packets that appear to be coming from the remote side.

So, at restart time, things start by simply opening a new socket to the remote end. Another new ioctl() command (SIOCSOUTSEQ) is used to set the sequence number before connecting to make it match the number found at checkpoint time. Once the connection process starts, the outgoing SYN packet will be intercepted - the remote side will certainly not be prepared to deal with it - and a SYN/ACK reply will be injected locally. The outgoing ACK must also be intercepted and dropped on the floor, of course. Once that is done, the kernel thinks it has an open connection, with sequence numbers matching the pre-checkpoint connection, to the remote side.

After that, it's a matter of restoring the incoming data that had been found queued in the kernel at checkpoint time; that is done by injecting new packets containing that data and intercepting the resulting ACKs from the network stack. Outgoing data, instead, can be replaced with a series of simple send() calls, but there is one little twist. Packets in the outgoing queue may have already been transmitted and received by the remote side. Retransmitting those packets is not a problem, as long as the size of those packets remains the same. If, instead, the system uses different offsets as it divides the outgoing data into packets, it can create confusion at the remote end. To keep that from happening, Tejun added one more ioctl() (SIOCFORCEOUTBD) to force the packets to match those created before the checkpoint operation began.

Once the transmit queue is restored, the connection is back to its original state. At this point, the interception of outgoing packets can stop.

All of this seems somewhat complex and fragile, but Tejun states that it "actually works rather reliably". That said, there are a lot of details that have been ignored; it is, after all, a proof-of-concept implementation. It's not meant to be a complete solution to the problem of checkpointing and restarting network connections; the idea is to show that the problem can, indeed, be solved. If the user-space checkpoint/restart work proceeds, it may well adopt some variant of this approach at some point. In the meantime, though, what we have is a fun hack showing what can be done with the new ptrace() commands. Those wanting more details on how it works can find them in the README file found in the example code repository.

Comments (35 posted)

The Extensible Firmware Interface - an introduction

August 9, 2011

This article was contributed by Matthew Garrett

In the beginning was the BIOS.

Actually, that's not true. Depending on where you start from, there was either some toggle switches used to enter enough code to start booting from something useful, a ROM that dumped you straight into a language interpreter or a ROM that was just barely capable of reading a file from tape or disk and going on from there. CP/M was usually one of the latter, jumping to media that contained some hardware-specific code and a relatively hardware-agnostic OS. The hardware-specific code handled receiving and sending data, resulting in it being called the "Basic Input/Output System." BIOS was born.

When IBM designed the PC they made a decision that probably seemed inconsequential at the time but would end up shaping the entire PC industry. Rather than leaving the BIOS on the boot media, they tied it to the initial bootstrapping code and put it in ROM. Within a couple of years vendors were shipping machines with reverse engineered BIOS reimplementations and the PC clone market had come into existence.

There's very little beauty associated with the BIOS, but what it had in its favor was functional hardware abstraction. It was possible to write a fairly functional operating system using only the interfaces provided by the system and video BIOSes, which meant that vendors could modify system components and still ship unmodified install media. Prices nosedived and the PC became almost ubiquitous.

The BIOS grew along with all of this. Various arbitrary limits were gradually removed or at least papered over. We gained interfaces for telling us how much RAM the system had above 64MB. We gained support for increasingly large drives. Network booting became possible. But limits remained.

The one that eventually cemented the argument for moving away from the traditional BIOS turned out to be a very old problem. Hard drives still typically have 512 byte sectors, and the MBR partition table used by BIOSes stores sectors in 32-bit variables. Partitions above 2TB? Not really happening. And while in the past this would have been an excuse to standardize on another BIOS extension, the world had changed. The legacy BIOS had lasted for around 30 years without ever having a full specification. The modern world wanted standards, compliance tests and management capabilities. Something clearly had to be done.

And so for the want of a new partition table standard, EFI arrived in the PC world.

Expedient Firmware Innovation

[1] Intel's other stated objection to Open Firmware was that it had its own device tree which would have duplicated the ACPI device tree that was going to be present in IA64 systems. One of the outcomes of the OLPC project was an Open Firmware implementation that glued the ACPI device tree into the Open Firmware one without anyone dying in the process, while meanwhile EFI ended up allowing you to specify devices in either the ACPI device tree or through a runtime enumerated hardware path. The jokes would write themselves if they weren't too busy crying.

[2] To be fair to Intel, choosing to have drivers be written in C rather than Forth probably did make EFI more attractive to third party developers than Open Firmware

Intel had at least 99 problems in 1998, and IA64 was certainly one of them. IA64 was supposed to be a break from the PC compatible market, and so it made sense for it to have a new firmware implementation. The 90s had already seen several attempts at producing cross-platform legacy-free firmware designs with the most notable probably being the ARC standard that appeared on various MIPS and Alpha platforms and Open Firmware, common on PowerPC and SPARCs. ARC mandated the presence of certain hardware components and lacked any real process for extending the specification, so got passed over. Open Firmware was more attractive but had a very limited third party developer community[1], so the choice was made to start from scratch in the hope that a third party developer community would be along eventually[2]. This was the Intel Boot Initiative, something that would eventually grow into EFI.

EFI is intended to fulfill the same role as the old PC BIOS. It's a pile of code that initializes the hardware and then provides a consistent and fairly abstracted view of the hardware to the operating system. It's enough to get your bootloader running and, then, for that bootloader to find the rest of your OS. It's a specification that's 2,210 pages long and still depends on the additional 727 pages of the ACPI spec and numerous ancillary EFI specs. It's a standard for the future that doesn't understand surrogate pairs and so can never implement full Unicode support. It has a scripting environment that looks more like DOS than you'd have believed possible. It's built on top of a platform-independent open source core that's already something like three times the size of a typical BIOS source tree. It's the future of getting anything to run on your PC. This is its story.

Eminently Forgettable Irritant

[3] The latest versions of EFI allow for a pre-PEI phase that verifies that the EFI code hasn't been modified. We heard you like layers.

[4] Those of you paying attention have probably noticed that the PEI sounds awfully like a BIOS, EFI sounds awfully like an OS and bootloaders sound awfully like applications. There's nothing standing between EFI and EMACS except a C library and a port of readline. This probably just goes to show something, but I'm sure I don't know what.

The theory behind EFI is simple. At the lowest level[3] is the Pre-EFI Initialization (PEI) code, whose job it is to handle setting up the low-level hardware such as the memory controller. As the entry point to the firmware, the PEI layer also handles the first stages of resume from S3 sleep. PEI then transfers control to the Driver Execution Environment (DXE) and plays no further part in the running system.

The DXE layer is what's mostly thought of as EFI. It's a hardware-agnostic core capable of loading drivers from the Firmware Volume (effectively a filesystem in flash), providing a standardized set of interfaces to everything that runs on top of it. From here it's a short step to a bootloader and UI, and then you're off out of EFI and you don't need to care any more[4].

The PEI is mostly uninteresting. It's the chipset-level secret sauce that knows how to turn a system without working RAM into a system with working RAM, which is a fine and worthy achievement but not typically something an OS needs to care about. It'll bring your memory out of self refresh and jump to the resume vector when you're coming out of S3. Beyond that? It's an implementation detail. Let's ignore it.

The DXE is where things get interesting. This is the layer that presents the interface embodied in the EFI specification. Devices with bound drivers are represented by handles, and each handle may implement any number of protocols. Protocols are uniquely identified with a GUID. There's a LocateHandle() call that gives you a reference to all handles that implement a given protocol, but how do you make the LocateHandle() call in the first place?

This turns out to be far easier than it could be. Each EFI protocol is represented by a table (ie, a structure) of data and function pointers. There's a couple of special tables which represent boot services (ie, calls that can be made while you're still in DXE) and runtime services (ie, calls that can be made once you've transitioned to the OS), and in turn these are contained within a global system table. The system table is passed to the main function of any EFI application, and walking it to find the boot services table then gives a pointer to the LocateHandle() function. Voilà.

So you're an EFI bootloader and you want to print something on the screen. This is made even easier by the presence of basic console io functions in the global EFI system table, avoiding the need to search for an appropriate protocol. A "Hello World" function would look something like this:

    #include <efi.h>
    #include <efilib.h>

    EFI_STATUS
    efi_main (EFI_HANDLE image, EFI_SYSTEM_TABLE *systab)
    {
        SIMPLE_TEXT_OUTPUT_INTERFACE *conout;

        conout = systab->ConOut;
        uefi_call_wrapper(conout->OutputString, 2, conout, L"Hello World!\n\r");

        return EFI_SUCCESS;
    }

In comparison, graphics require slightly more effort:

    #include <efi.h>
    #include <efilib.h>

    extern EFI_GUID GraphicsOutputProtocol;

    EFI_STATUS
    efi_main (EFI_HANDLE image, EFI_SYSTEM_TABLE *systab)
    {
        EFI_GRAPHICS_OUTPUT_PROTOCOL *gop;
	EFI_GRAPHICS_OUTPUT_MODE_INFORMATION *info;
	UINTN SizeOfInfo;

	uefi_call_wrapper(BS->LocateProtocol, 3, &GraphicsOutputProtocol,
	                  NULL, &gop);
	uefi_call_wrapper(gop->QueryMode, 4, gop, 0, &SizeOfInfo, &info);

	Print(L"Mode 0 is running at %dx%d\n", info->HorizontalResolution,
	      info->VerticalResolution);

	return 0;
    }

[5] Well, except that things are obviously more complicated. It's possible for multiple device handles to implement a single protocol, so you also need to work out whether you're speaking to the right one. That can end up being trickier than you'd like it to be.

Here we've asked the firmware for the first instance of a device implementing the Graphics Output Protocol. That gives us a table of pointers to graphics related functionality, and we're free to call them as we please.[5]

Extremely Frustrating Issues

So far it all sounds straightforward from the bootloader perspective. But EFI is full of surprising complexity and frustrating corner cases, and so (unsurprisingly) attempting to work on any of this rapidly leads to confusion, anger and a hangover. We'll explore more of the problems in the next part of this article.

Comments (52 posted)

Network transmit queue limits

By Jonathan Corbet
August 9, 2011

Network performance depends heavily on buffering at almost every point in a packet's path. If the system wants to get full performance out of an interface, it must ensure that the next packet is ready to go as soon as the device is ready for it. But, as the developers working on bufferbloat have confirmed, excessive buffering can lead to problems of its own. One of the most annoying of those problems is latency; if an outgoing packet is placed at the end of a very long queue, it will not be going anywhere for a while. A classic example can be reproduced on almost any home network: start a large outbound file copy operation and listen to the loud complaints from the World of Warcraft player in the next room; it should be noted that not all parents see this behavior as a bad thing. But, in general, latency caused by excessive buffering is indeed worth fixing.

One assumes that the number of Warcraft players on the Google campus is relatively small, but Google worries about latency anyway. Anything that slows down response makes Google's services slower and less attractive. So it is not surprising that we have seen various latency-reducing changes from Google, including the increase in the initial congestion window merged for 2.6.38. A more recent patch from Google's Tom Herbert attacks latency caused by excessive buffering, but its future in its current form is uncertain.

An outgoing packet may pass through several layers of buffering before it hits the wire for even the first hop. There may be queues within the originating application, in the network protocol code, in the traffic control policy layers, in the device driver, and in the device itself - and probably in several other places as well. A full solution to the buffering problem will likely require addressing all of these issues, but each layer will have its own concerns and will be a unique problem to solve. Tom's patch is aimed at the last step in the system - buffering within the device's internal transmit queue.

Any worthwhile network interface will support a ring of descriptors describing packets which are waiting to be transmitted. If the interface is busy, there should always be some packets buffered there; once the transmission of one packet is complete, the interface should be able to begin the next one without waiting for the kernel to respond. It makes little sense, though, to buffer more packets in the device than is necessary to keep the transmitter busy; anything more than that will just add latency. Thus far, little thought has gone into how big that buffer should be; the default is often too large. On your editor's system, ethtool says that the length of the transmit ring is 256 packets; on a 1G Ethernet, with 1500-byte packets, that ring would take almost 4ms to transmit completely. 4ms is a fair amount of latency to add to a local transmission, and it's only one of several possible sources of latency. It may well make sense to make that buffer smaller.

The problem, of course, is that the ideal buffer size varies considerably from one system - and one workload - to the next. A lightly-loaded system sending large packets can get by with a small number of buffered packets. If the system is heavily loaded, more time may pass before the transmit queue can be refilled, so that queue should be larger. If the packets being transmitted are small, it will be necessary to buffer more of them. A few moments spent thinking about the problem will make it clear that (1) the number of packets is the wrong parameter to use for the size of the queue, and (2) the queue length must be a dynamic parameter that responds to the current load on the system. Expecting system administrators to tweak transmit queue lengths manually seems like a losing strategy.

Tom's patch adds a new "dynamic queue limits" (DQL) library that is meant to be a general-purpose queue length controller; on top of that he builds the "byte queue limits" mechanism used within the networking layer. One of the key observations is that the limit should be expressed in bytes rather than packets, since the number of queued bytes more accurately approximates the time required to empty the queue. To use this code, drivers must, when queueing packets to the interface, make a call to one of:

    void netdev_sent_queue(struct net_device *dev, unsigned int pkts, unsigned int bytes);
    void netdev_tx_sent_queue(struct netdev_queue *dev_queue, unsigned int pkts,
			      unsigned int bytes);

Either of these functions will note that the given number of bytes have been queued to the given device. If the underlying DQL code determines that the queue is long enough after adding these bytes, it will tell the upper layers to pass no more data to the device for now.

When a transmission completes, the driver should call one of:

    void netdev_completed_queue(struct net_device *dev, unsigned pkts, unsigned bytes);
    void netdev_tx_completed_queue(struct netdev_queue *dev_queue, unsigned pkts,
				   unsigned bytes);

The DQL library will respond by reenabling the flow of packets into the driver if the length of the queue has fallen far enough.

In the completion routine, the DQL code also occasionally tries to adjust the queue length for optimal performance. If the queue becomes empty while transmission has been turned off in the networking code, the queue is clearly too short - there was not time to get more packets into the stream before the transmitter came up dry. On the other hand, if the queue length never goes below a given number of bytes, the maximum length can probably be reduced by up to that many bytes. Over time, it is hoped that this algorithm will settle on a reasonable length and that it will be able to respond if the situation changes and a different length is called for.

The idea behind this patch makes sense, so nobody spoke out against it. Stephen Hemminger did express concerns about the need to add explicit calls to drivers to make it all work, though. The API for network drivers is already complex; he would like to avoid making it more so if possible. Stephen thinks that it should be possible to watch traffic flowing through the device at the higher levels and control the queue length without any knowledge or cooperation from the driver at all; Tom is not yet convinced that this will work. It will probably take some time to figure out what the best solution is, and the code could end up changing significantly before we see dynamic transmit queue length control get into the mainline.

Comments (20 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.1-rc1 ?

Greg KH Linux 3.0.1 ?

Peter Zijlstra 3.0-rt7 ?

Peter Zijlstra 3.0.1-rt8 ?

Greg KH Linux 2.6.33.17 ?

Greg KH Linux 2.6.32.44 ?

Architecture-specific

Mark Salter C6X: New architecture patch set ?

Catalin Marinas ARM: Add support for the Large Physical Address Extensions ?

Marc Zyngier [RFC PATCH] ARM: sched_clock: allow sched_clock to be selected at runtime ?

Core kernel code

Oleg Nesterov utrace for 3.1 kernel ?

Development tools

Hui Zhu KGTP (Linux Kernel debugger and tracer) 20110805 release (performance counters available) ?

Device drivers

Inki Dae [RFC PATCH] DRM: add DRM Driver for Samsung SoC EXYNOS4210. ?

Filesystems and block I/O

Miklos Szeredi overlay filesystem: request for inclusion ?

Philipp Reisner request for review - DRBD multi volume support for 3.2 ?

Memory management

Wu Fengguang IO-less dirty throttling v8 ?

Dave Chinner dcache: per-sb LRU locks ?

KAMEZAWA Hiroyuki memg: better numa scanning ?

Dan Magenheimer [PATCH V6 0/4] mm: frontswap: overview (and proposal to merge at next window) ?

Christoph Lameter slub: per cpu partial lists V4 ?

Mel Gorman Reduce filesystem writeback from page reclaim v3 ?

Networking

Tejun Heo Parasite thread injection and TCP connection hijacking ?

Tom Herbert bql: Byte Queue Limits ?

Thomas Pedersen mac80211: add mesh gate forwarding ?

Miscellaneous

Tetsuo Handa tomoyo-tools 2.4.0 released. ?

Page editor: Jonathan Corbet
Next page: Distributions>>