User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.23-rc1, no prepatches have been released over the last week. Well over 500 changesets have been merged into the mainline git repository since -rc1, though, and the -rc2 release is overdue. The changes are mostly fixes, but there is also the addition of the "literate" Lguest documentation, a mechanism where kernel-space code can request notification when it is about to be preempted from the CPU, new configuration options for software suspend and hibernation, the removal of support for SuperH sh73180 and 7300 CPUs, AMD Geode LX framebuffer support, the removal of the arm26 port, and a TCP congestion control API change (pkts_ackt() gets the round-trip time in microseconds now).

The current -mm tree is 2.6.23-rc1-mm2. Recent changes to -mm include support for multiple netconsole targets, a Sonics Silicon backplane subsystem, and a bunch of reiser4 fixes.

For older kernels: was released on July 25 with about a dozen fixes. 2.4.35 was released on July 26 with a number of backported drivers and fixes.

Comments (none posted)

Kernel development news

Quotes of the week

The tty layer is one of the very few pieces of kernel code that scares the hell out of me.
-- Ingo Molnar

I wish people would focus less on who wrote the actual code that got merged in the end, and more on the problem that got solved.... People who care about the desktop should be happy that the scheduler improved a lot due to the competition where the two new schedulers were hair-close in most aspects.
-- Arjan van de Ven

This spec says that systems which can not automatically go into suspend within 15 minutes of idle can _not_ earn a sticker. No sticker, no client computer sales to governments. If Linux can't get STR [suspend-to-RAM] working, broadly deployed, and enabled by default, then our plans for world domination are going to take a significant hit.
-- Len Brown

Comments (3 posted)

Suspend and hibernation status report

Rafael Wysocki, the current maintainer of the suspend and hibernation code in the kernel, has put together a lengthy document describing the current state of the art. "this document is intended as an introductory presentation of the current (ie. as in the 2.6.23-rc1 kernel) design of the suspend (ie. suspend-to-RAM and standby) and hibernation code, the status of it, known problems with it and the future development plans." It's a long read but interesting for those who are interested in this subsystem.

Full Story (comments: 5)

Controlling memory use in containers

By Jonathan Corbet
July 31, 2007
"Containers" is the term normally applied to a lightweight virtualization approach where all guest systems run on the host system's kernel (as opposed to running their own kernel on a special virtual machine). The container technique tends to be more efficient at run time, but it poses challenges of its own; since every container runs on the same kernel, a whole series of internal barriers must be created to give each container the illusion of having a machine to itself. The addition of these barriers to the Linux kernel has been a multi-year process as the various projects working in this area work out a set of interfaces that works for everybody.

An important part of a complete container implementation is resource control; it is hard to maintain the fiction of a separate machine for each container if one of those containers is hogging the entire system. Extensive resource management patches have received a chilly reception in the past, but a properly done implementation based on the process containers framework might just make it in. The CFS group scheduling patch can be seen as one type of container-based resource management. But there is far more than just the CPU to worry about.

One of the most contended resources on many systems is core memory. A container which grows without bound and forces other containers out to swap will lead to widespread grumbling on the linux-kernel list. In an effort to avoid this unfortunate occurrence, Balbir Singh and Pavel Emelianov have been working on a container-based memory controller implementation. This patch is now in its fourth iteration.

The patch starts with a simple "resource counter" abstraction which is meant to be used with containers. It will work with any resource which can be described with simple, integer values for the maximum allowed and current usage. Methods are provided to enable hooking these counters into container objects and allowing them to be queried and set from user space.

These counters are pressed into service to monitor the memory use by each container. Memory use can be thought of as the current resident set: the number of resident pages which processes within the container have mapped into their virtual address spaces. Unlike some previous patches, though, the current memory controller also tries to track page cache usage. So a program which is very small, but which brings a lot of data from the filesystem into the page cache, will be seen as using a lot of memory.

To track per-container page usage, the memory controller must hook into the low-level page cache and reclaim code. It must also have a place to store information about which container each page is charged to. To that end, a new structure is created:

    struct meta_page {
	struct list_head lru;
	struct page *page;
	struct mem_container *mem_container;
	atomic_t ref_cnt;

Global memory management is handled by way of two least-recently-used (LRU) lists, the hope being that the pages which have been unused for the longest are the safest ones to evict when memory gets tight. Once containers are added to the mix, though, global management is not enough. So the meta_page structure allows each page to be put onto a separate, container-specific LRU list. When a process within a container brings in a page and causes the container to bump up against its memory limit, the kernel must, if it is to enforce that limit, push some of the container's other pages out. When that situation arises, the container-specific LRU list is traversed to find reclaimable pages belonging to the container without having to pass over unrelated memory.

The page structure in the global memory map gains a pointer to the associated meta_page structure. There is also a new page flag allocated for locking that structure. There is no meta_page structure for kernel-specific pages, but one is created for every user-space or page cache page - even for processes which have not explicitly been assigned to a container (those processes are implicitly placed in a default, global container). There is, thus, a significant cost associated with the memory controller - the addition of five pointers (one in struct page, four in struct meta_page) and one atomic_t for every active page in the system can only hurt.

With this mechanism in place, the kernel is able to implement basic memory usage control for containers. One little issue remains: what happens when the kernel is unable to force a container's memory usage below its limit? In that case, the dreaded out-of-memory killer comes into play; there is a special version of the OOM killer which restricts its predations to a single container. So, in this situation, some process will die, but other containers should be unaffected.

One interesting aspect of the problem which appears to not have been completely addressed is pages which are used by processes in more than one container. Many shared libraries will fall into this category, but much page cache usage could as well. The current code charges a page to the first container which makes use of it. Should the page be chosen to be evicted, it will be unmapped from all containers; if a different container then faults the page in, that container will be charged for it going forward. So, over time, the reclaim mechanism may well cause the charging of shared pages to be spread across the containers on the system - or to accumulate in a single, unlimited container, should one exist. Determining whether real problems could result from this mechanism will require extensive testing with a variety of workloads, and, one suspects, that effort has barely begun.

For now we have a memory controller framework which appears to be capable of doing the core job, which is a good start. It is clearly applicable to the general container problem, but might just prove useful in other situations as well. A system administrator might not want to implement full-blown containers, but might be interested in, for example, keeping filesystem-intensive background jobs (updatedb, backups, etc.) within certain memory bounds. Users could put a boundary around, say, to keep it from pushing everything else of interest out of memory. There would seem to be enough value here to justify the inclusion of this patch - though a bit of work may be required first.

Comments (15 posted)

Large receive offload

By Jonathan Corbet
August 1, 2007
High-performance networking is continually faced with a challenge: local networking technologies are getting faster more quickly than processor and memory speeds. So every time that the venerable Ethernet technology provides another speed increment, networking developers must find ways to enable the rest of the system to keep up - even on fast contemporary hardware.

One recurring idea is to push more of the work into the networking hardware itself. TCP offload engines have been around since the days when systems were having trouble keeping up with 10Mb Ethernet, but that technology has always been limited in its acceptance; see this 2005 LWN article for some discussion of why. But some more restrained hardware assist techniques have been more successful; for example, TCP segmentation offload (TSO), where network adapters turn a stream of data into packets for transmission, is well supported under Linux.

Use of TSO can boost networking performance considerably. When one is dealing with thousands of packets every second, even a slight per-packet assist will add up. TSO reduces the amount of work needed to build headers and checksum the data, and it cuts down on the number of times that the driver must program operations into the network adapter. There is, however, no analogous assistance for incoming data. So, if you have two identical Linux servers with one sending a high-bandwidth stream to the other, the receiving side may be barely keeping up with the load while the transmitting side barely breaks a sweat.

Proposals for assistance for packet reception often go under the name "large receive offload" (LRO); the idea was first proposed for Linux in this OLS 2005 talk [PDF]. The initial LRO implementation used hardware features found in Neterion adapters; it never made it into the mainline and little has been heard from that direction since. The LRO idea has recently returned, though, in the form of this patch by Jan-Bernd Themann. Interestingly, the new LRO code does not require any hardware assistance at all.

With Jan-Bernd's patch, a driver must, to support LRO, fill in an LRO manager structure which looks like this:

    #include <linux/inet_lro.h>

    struct net_lro_mgr {
	struct net_device *dev;
	struct net_lro_stats stats;

	unsigned long features;

	u32 ip_summed; /* Options to be set in generated SKB in page mode */
	int max_desc; /* Max number of LRO descriptors  */
	int max_aggr; /* Max number of LRO packets to be aggregated */

	struct net_lro_desc *lro_arr; /* Array of LRO descriptors */
	 * Optimized driver functions
	 * get_skb_header: returns tcp and ip header for packet in SKB
	int (*get_skb_header)(struct sk_buff *skb, void **ip_hdr,
			      void **tcpudp_hdr, u64 *hdr_flags, void *priv);

	 * get_frag_header: returns mac, tcp and ip header for packet in SKB
	 * @hdr_flags: Indicate what kind of LRO has to be done
	 *             (IPv4/IPv6/TCP/UDP)
	int (*get_frag_header)(struct skb_frag_struct *frag, void **mac_hdr,
			       void **ip_hdr, void **tcpudp_hdr, u64 *hdr_flags,
			       void *priv);

In this structure, dev is the network interface for which LRO is to be implemented; stats contains some statistics which can be queried to see how well LRO is working. The features field controls how the LRO code should feed packets into the networking stack; it has two flags defined currently:

  • LRO_F_NAPI says that the driver is NAPI compliant, and that, in particular, packets should be passed upward with netif_receive_skb().

  • LRO_F_EXTRACT_VLAN_ID is for drivers with VLAN support. This article won't go further into VLAN support for the simple reason that your editor does not understand it.

Checksum information for the final packets should go into ip_summed. The maximum number of "LRO descriptors" should be stored in max_desc. Each descriptor describes one TCP stream, so the maximum limits the number of streams for which LRO can be done simultaneously. Increasing the maximum requires more memory and will slow things a bit, since packets are matched to streams by way of a linear search. max_aggr is the maximum number of incoming packets which will be aggregated into a single, larger packet. The lro_arr array contains the descriptors for tracking streams; the driver should provide it with enough memory for at least max_desc structures or very unpleasant things are likely to happen.

Finally, there are the get_skb_header() and get_frag_header() methods. Their job is to locate the IP and TCP headers in a packet as quickly as possible. Typically a driver will only provide one of the two functions, depending on how it feeds packets into the LRO aggregation code.

A driver which receives packets in fully-completed sk_buff structures would normally pass them up directly to the network stack with netif_rx() or netif_receive_skb(). If LRO is being done, instead, the packets should be handed to:

    void lro_receive_skb(struct net_lro_mgr *lro_mgr,
		     	 struct sk_buff *skb,
		     	 void *priv);

This function will attempt to identify an LRO descriptor for the given packet, creating one if need be. Then it will try to join that packet with any others in the stream, making one large, fragmented packet. In the process, it will call the driver's get_skb_header() method, passing through the pointer given as priv. If the packet cannot be aggregated with others (it may not be a TCP packet, for example, or it could have TCP options which require it to be processed separately) it will be passed directly to the network stack. Either way, the driver can consider it delivered and move on to its next task.

Some drivers receive packets directly into memory represented by page structures, constructing the full sk_buff structure after reception. For such drivers, the interface is:

    void lro_receive_frags(struct net_lro_mgr *lro_mgr,
	 	       	   struct skb_frag_struct *frags,
			   int len, int true_size, 
			   void *priv, __wsum sum);

The LRO code will build the necessary sk_buff structure, perhaps aggregating fragments from several packets, and (sooner or later) feed the results to the network stack. It will call the driver's get_frag_header() method to locate the headers in the first element of the frags array; that method should also ensure that the packet is an IPv4 TCP packet and set LRO_IPV4 and LRO_TCP in the flags argument if so.

Combined packets will be pushed up into the network stack whenever max_aggr individual packets have been merged into them. Delaying data for too long while waiting for additional packets is not a good idea, though; occasionally packets should be sent on even if they are not as large as they could be. The function for this job is:

    void lro_flush_all(struct net_lro_mgr *lro_mgr);

It will cause all packets to sent on. A logical place for such a call might be at the end of a NAPI driver's poll() method. An individual stream can be flushed with:

    void lro_flush_pkt(struct net_lro_mgr *lro_mgr,
		       struct iphdr *iph, 
		       struct tcphdr *tcph);

This call will locate the stream associated with the given IP and TCP headers and send its accumulated data onward. It will not add any data associated with the given headers; the packet associated with those headers should have already been added with one of the receive functions if need be.

That is, for all practical purposes, the entire interface. One might well wonder how this code can improve performance, given that it is just aggregating packets which have already been received in the usual way by the driver. The answer is that it is reducing the number of packets that the network stack has to work with, cutting the per-packet overhead at higher levels in the stack. A clever driver can, using the struct page approach, also reduce the number of memory allocations required for each packet, which can be a big win. So LRO appears to be worth having, and current plans call for it to be merged in 2.6.24.

Comments (1 posted)

i386 and x86_64: back together?

By Jonathan Corbet
July 31, 2007
The arch directory in the kernel source tree contains all of the architecture-specific code. There is a lot of code there, despite years of work by the development community to make things generic whenever possible. There are currently 26 different top-level architectures supported by Linux, many of which contain a number of sub-architectures. Two of those top-level architectures are i386 (the original Linux architecture) and x86_64, which is the 64-bit big brother to i386. There is quite a bit of commonality between those two architectures, and some efforts have been made to share code between them whenever possible. Even so, the source trees for the two architectures remain distinct from each other.

In the view of some developers, at least, the separation of the two architecture trees is a problem. A bug fix which must be applied to one often is applicable to the other, but it's not clear that all fixes are being made in both places. New features, too, must be added twice. It is relatively easy to break one architecture while working on the other. Developers working on architecture-specific projects - virtualization is mentioned often - end up having to do a lot of work to keep up with two strongly related trees. In response to this kind of pressure, the 32-bit and 64-bit PowerPC architectures were merged into a single architecture tree in 2.6.15, and the general consensus seems to be that it was a good move. But no such merger has happened for the x86 variants.

That may be about to change, though: Thomas Gleixner and Ingo Molnar recently posted a patch to merge the two architectures with a request for comments. This patch is huge: it weighs in at over 9MB and touches 1764 files. It is so tied to the current state of the kernel tree that it can only be reasonably applied to one specific commit point in the git repository. This is not the patch which is meant to be applied, though; its purpose is to show what the final result would look like. If and when the time comes to actually merge this patch, it will be done differently:

As a next step we plan to generate a gradual, fully bisectable, fully working switchover from the current code to the fully populated arch/x86 tree. It will result in about 1000-2000 commits.

That is a little intimidating as well. Knowing this, the developers of this patch have gone out of their way to make it possible to apply the change with a high level of confidence. So there will be no code changes associated with the merger: it will be possible to build the exact same kernel image from the source tree before and after the change.

The patch creates a new architecture called x86 and moves everything from the two existing architectures over. In the small number of cases where each architecture has an identical copy of the same file, only a single file is created in the new tree. More often, though, the two architectures have a file by the same name in the same place, but their contents differ. In such cases, both files are moved into the new tree with a _32 or _64 suffix, depending on where it came from. So, for example, both architectures contain kernel/ioport.c; the new x86 architecture has ioport_32.c and ioport_64.c. Some simple trickery is then employed to ensure that the correct files for the target architecture are built.

In many (if not most) cases, there is a great deal of common code in the two files, and that common code is left there. The idea at this stage of the game is to get the two architecture trees together without affecting the resulting kernel; that is probably the only way that such a big change would ever be accepted. Once things have been merged, the opportunities for eliminating duplicated code between individual files will become more apparent - the files will usually be right next to each other. One imagines that an army of code janitors would swoop in to do this work, much of which would be relatively straightforward. Once it's done, we would have a shiny new, merged architecture with duplicated code squeezed out, and everybody would be happy.

Or maybe not. Andi Kleen has expressed his opposition to this change:

I think it's a bad idea because it means we can never get rid of any old junk. IMNSHO arch/x86_64 is significantly cleaner and simpler in many ways than arch/i386 and I would like to preserve that. Also in general arch/x86_64 is much easier to hack than arch/i386 because it's easier to regression test and in general has to care about much less junk. And I don't know of any way to ever fix that for i386 besides splitting the old stuff off completely.

Andi, by virtue of being the maintainer of the i386 and x86_64 architectures, has a relatively strong voice in this discussion. His core argument - that splitting the architectures allows lots of legacy issues to be confined to the i386 tree - reflects a common practice in kernel code management. Code which only supports relatively new hardware tends to be a lot cleaner than code which handles older devices as well, but removal of support for hardware which is still in use is frowned upon. So, instead, a new subsystem is created for the newer stuff, with the idea that the legacy code can be supported separately until it withers away. A classic example is the way that serial ATA support was implemented within its own subsystem instead of being an addition to the IDE code. Andi, along with a few others, argues that x86-family processor support should be handled in the same way.

Most of the participants in the early discussion would appear to disagree with Andi, though. Unlike legacy IDE devices, it is argued, the 32-bit architecture is not going to disappear anytime soon. The number of quirks which are truly unique to the i386 architecture is seen as being relatively small. Linus argues that it's easier to carry forward legacy code when it's part of a shared tree than when it's shoved off into a corner. Judging from the conversation which followed the initial posting, there is a near-consensus that the unified tree is the right way to go.

There were a couple of suggestions that the patch could go directly into 2.6.23, but it is probably just as well that things did not happen that way. 2.6.23 has a lot of new stuff already, and this patch is new. Allowing a cycle for the work to age can only be helpful. Besides, we have not yet seen a repository with those 1000 or so separate commits in it.

More to the point, though: the real discussion on the merger has not yet happened. To rework two architectures into one over the objections of the maintainer would be an extraordinary step verging on a hostile takeover of the code. Maintainers do not have absolute veto power over patches, but overriding a maintainer on a patch this big is not something which is done lightly. So the developers of the unified x86 architecture patch have one big challenge remaining: they have solved the technical issues nicely, and they have convinced much of the development community that this change should be made. But it would be in the best interests of everybody involved if they could find a way to convince the maintainer of the code they are working with as well.

Comments (9 posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Development tools

Device drivers


Filesystems and block I/O


Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds