Brief items
The current 2.6 prepatch remains 2.6.23-rc1, no prepatches have been
released over the last week. Well over 500 changesets have been merged
into the mainline git repository since -rc1, though, and the -rc2 release
is overdue. The changes are mostly fixes, but there is also the addition
of the "literate" Lguest documentation, a mechanism where kernel-space code
can request notification when it is about to be preempted from the CPU, new
configuration options for software suspend and hibernation, the removal of
support for SuperH sh73180 and 7300 CPUs, AMD Geode LX framebuffer support,
the removal of the arm26 port, and a TCP congestion control API change
(
pkts_ackt() gets the round-trip time in microseconds now).
The current -mm tree is 2.6.23-rc1-mm2. Recent changes
to -mm include support for multiple netconsole targets, a Sonics Silicon
backplane subsystem, and a bunch of reiser4 fixes.
For older kernels: 2.6.16.53 was released on
July 25 with about a dozen fixes. 2.4.35 was released on
July 26 with a number of backported drivers and fixes.
Comments (none posted)
Kernel development news
The tty layer is one of the very few pieces of kernel code that
scares the hell out of me.
--
Ingo Molnar
I wish people would focus less on who wrote the actual code that
got merged in the end, and more on the problem that got
solved.... People who care about the desktop should be happy that
the scheduler improved a lot due to the competition where the two
new schedulers were hair-close in most aspects.
--
Arjan van de Ven
This spec says that systems which can not automatically go into
suspend within 15 minutes of idle can _not_ earn a sticker. No
sticker, no client computer sales to governments. If Linux can't
get STR [suspend-to-RAM] working, broadly deployed, and enabled by default, then
our plans for world domination are going to take a significant
hit.
--
Len Brown
Comments (3 posted)
Rafael Wysocki, the current maintainer of the suspend and hibernation code
in the kernel, has put together a lengthy document describing the current
state of the art. "
this document is intended as an
introductory presentation of the current (ie. as in the 2.6.23-rc1 kernel)
design of the suspend (ie. suspend-to-RAM and standby) and hibernation code,
the status of it, known problems with it and the future development
plans." It's a long read but interesting for those who are
interested in this subsystem.
Full Story (comments: 5)
By Jonathan Corbet
July 31, 2007
"Containers" is the term normally applied to a lightweight virtualization
approach where all guest systems run on the host system's kernel (as
opposed to running their own kernel on a special virtual machine). The
container technique tends to be more efficient at run time, but it poses
challenges of its own; since every container runs on the same kernel, a
whole series of internal barriers must be created to give each container
the illusion of having a machine to itself. The addition of these barriers
to the Linux kernel has been a multi-year process as the various projects
working in this area work out a set of interfaces that works for everybody.
An important part of a complete container implementation is resource
control; it is hard to maintain the fiction of a separate machine for each
container if one of those containers is hogging the entire system.
Extensive resource management patches have received a chilly reception in the past,
but a properly done implementation based on the process containers framework
might just make it in. The CFS
group scheduling patch can be seen as one type of container-based
resource management. But there is far more than just the CPU to worry
about.
One of the most contended resources on many systems is core memory. A
container which grows without bound and forces other containers out to swap
will lead to widespread grumbling on the linux-kernel list. In an effort
to avoid this unfortunate occurrence, Balbir Singh and Pavel Emelianov have
been working on a container-based
memory controller implementation. This patch is now in its fourth
iteration.
The patch starts with a simple "resource counter" abstraction which is
meant to be used with containers. It will work with any resource which can
be described with simple, integer values for the maximum allowed and
current usage. Methods are provided to enable hooking these counters into
container objects and allowing them to be queried and set from user space.
These counters are pressed into service to monitor the memory use by each
container. Memory use can be thought of as the current resident set: the
number of resident pages which processes within the container have mapped
into their virtual address spaces. Unlike some previous patches, though,
the current memory controller also tries to track page cache usage. So a
program which is very small, but which brings a lot of data from the
filesystem into the page cache, will be seen as using a lot of memory.
To track per-container page usage, the memory controller must hook into the
low-level page cache and reclaim code. It must also have a place to store
information about which container each page is charged to. To that end, a
new structure is created:
struct meta_page {
struct list_head lru;
struct page *page;
struct mem_container *mem_container;
atomic_t ref_cnt;
};
Global memory management is handled by way of two least-recently-used (LRU)
lists, the hope being that the pages which have been unused for the longest
are the safest ones to evict when memory gets tight. Once containers are
added to the mix, though, global management is not enough. So the
meta_page structure allows each page to be put onto a separate,
container-specific LRU list. When a process within a container brings in a
page and causes the container to bump up against its memory limit, the
kernel must, if it is to enforce that limit, push some of the container's
other pages out. When that situation arises, the container-specific LRU
list is traversed to find reclaimable pages belonging to the container
without having to pass over unrelated memory.
The page structure in the global memory map gains a pointer to the
associated meta_page structure. There is also a new page flag
allocated for locking that structure. There is no meta_page
structure for kernel-specific pages, but one is created for every
user-space or page cache page - even for processes which have not
explicitly been assigned to a container (those processes are implicitly
placed in a default, global container). There is, thus, a significant
cost associated with the memory controller - the addition of five pointers
(one in struct page, four in struct meta_page) and one
atomic_t for every active page in the system can only hurt.
With this mechanism in place, the kernel is able to implement basic memory
usage control for containers. One little issue remains: what happens when
the kernel is unable to force a container's memory usage below its limit?
In that case, the dreaded out-of-memory killer comes into play; there is a
special version of the OOM killer which restricts its predations to a
single container. So, in this situation, some process will die, but other
containers should be unaffected.
One interesting aspect of the problem which appears to not have been
completely
addressed is pages which are used by processes in more than one container.
Many shared libraries will fall into this category, but much page cache
usage could as well. The current code charges a page to the
first container which makes use of it. Should the page be chosen to be
evicted, it will be unmapped from all containers; if a different container
then faults the page in, that container will be charged for it going
forward. So, over time, the reclaim mechanism
may well cause the charging of shared pages to be spread across the
containers on the system - or to accumulate in a single, unlimited
container, should one exist.
Determining whether real problems could result from this mechanism will
require extensive testing with a variety of workloads, and, one suspects,
that effort has barely begun.
For now we have a memory controller framework which appears to be capable
of doing the core job, which is a good start. It is clearly applicable to
the general container problem, but might just prove useful in other
situations as well. A system administrator might not want to implement
full-blown containers, but might be interested in, for example, keeping
filesystem-intensive background jobs (updatedb, backups, etc.)
within certain memory bounds. Users could put a boundary around, say,
OpenOffice.org to keep it from pushing everything else of interest out of
memory. There would seem to be enough value here to justify the inclusion
of this patch - though a bit of work may be required first.
Comments (15 posted)
By Jonathan Corbet
August 1, 2007
High-performance networking is continually faced with a challenge: local
networking technologies are getting faster more quickly than processor and
memory speeds. So every time that the venerable Ethernet technology
provides another speed increment, networking developers must find ways to
enable the rest of the system to keep up - even on fast contemporary
hardware.
One recurring idea is to push more of the work into the
networking hardware itself. TCP offload engines have been around since the
days when systems were having trouble keeping up with 10Mb Ethernet, but
that technology has always been limited in its acceptance; see this 2005 LWN article for some
discussion of why. But some more restrained hardware assist techniques
have been more successful; for example, TCP segmentation offload (TSO), where
network adapters turn a stream of data into packets for transmission, is
well supported under Linux.
Use of TSO can boost networking performance considerably. When one is
dealing with thousands of packets every second, even a slight per-packet
assist will add up. TSO reduces the amount of work needed to build headers
and checksum the data, and it cuts down on the number of times that the
driver must program operations into the network adapter. There is,
however, no analogous assistance for incoming data. So, if you have two
identical Linux servers with one sending a high-bandwidth stream to the
other, the receiving side may be barely keeping up with the load while the
transmitting side barely breaks a sweat.
Proposals for assistance for packet reception often go under the name
"large receive offload" (LRO); the idea was first proposed for Linux in this
OLS 2005 talk [PDF]. The initial LRO implementation used hardware
features found in Neterion adapters; it never made it into the mainline and
little has been heard from that direction since. The LRO idea has recently
returned, though, in the form of this patch by Jan-Bernd
Themann. Interestingly, the new LRO code does not require any hardware
assistance at all.
With Jan-Bernd's patch, a driver must, to support LRO, fill in an LRO
manager structure which looks like this:
#include <linux/inet_lro.h>
struct net_lro_mgr {
struct net_device *dev;
struct net_lro_stats stats;
unsigned long features;
u32 ip_summed; /* Options to be set in generated SKB in page mode */
int max_desc; /* Max number of LRO descriptors */
int max_aggr; /* Max number of LRO packets to be aggregated */
struct net_lro_desc *lro_arr; /* Array of LRO descriptors */
/*
* Optimized driver functions
*
* get_skb_header: returns tcp and ip header for packet in SKB
*/
int (*get_skb_header)(struct sk_buff *skb, void **ip_hdr,
void **tcpudp_hdr, u64 *hdr_flags, void *priv);
/*
* get_frag_header: returns mac, tcp and ip header for packet in SKB
*
* @hdr_flags: Indicate what kind of LRO has to be done
* (IPv4/IPv6/TCP/UDP)
*/
int (*get_frag_header)(struct skb_frag_struct *frag, void **mac_hdr,
void **ip_hdr, void **tcpudp_hdr, u64 *hdr_flags,
void *priv);
};
In this structure, dev is the network interface for which LRO is
to be implemented; stats contains some statistics which can be
queried to see how well LRO is working. The features field
controls how the LRO code should feed packets into the networking stack; it
has two flags defined currently:
- LRO_F_NAPI says that the driver is NAPI compliant, and that, in
particular, packets should be passed upward with
netif_receive_skb().
- LRO_F_EXTRACT_VLAN_ID is for drivers with VLAN support. This
article won't go further into VLAN support for the simple reason that
your editor does not understand it.
Checksum information for the final packets should go into
ip_summed. The maximum number of "LRO descriptors" should be
stored in max_desc. Each descriptor describes one TCP stream, so
the maximum limits the number of streams for which LRO can be done
simultaneously. Increasing the maximum requires more memory and will slow
things a bit, since packets are matched to streams by way of a linear
search. max_aggr is the maximum number of incoming packets which
will be aggregated into a single, larger packet. The lro_arr
array contains the descriptors for tracking streams; the driver should
provide it with enough memory for at least max_desc structures or
very unpleasant things are likely to happen.
Finally, there are the get_skb_header() and
get_frag_header() methods. Their job is to locate the IP and TCP
headers in a packet as quickly as possible. Typically a driver will only
provide one of the two functions, depending on how it feeds packets into
the LRO aggregation code.
A driver which receives packets in fully-completed sk_buff
structures would normally pass them up directly to the network stack with
netif_rx() or netif_receive_skb(). If LRO is being done,
instead, the packets should be handed to:
void lro_receive_skb(struct net_lro_mgr *lro_mgr,
struct sk_buff *skb,
void *priv);
This function will attempt to identify an LRO descriptor for the given
packet, creating one if need be. Then it will try to join that packet with
any others in the stream, making one large, fragmented packet. In the
process, it will call the driver's get_skb_header() method,
passing through the pointer given as priv. If the packet cannot
be aggregated with others (it may not be a TCP packet, for example, or it
could have TCP options which require it to be processed separately) it will
be passed directly to the network stack. Either way, the driver can
consider it delivered and move on to its next task.
Some drivers receive packets directly into memory represented by
page structures, constructing the full sk_buff structure
after reception. For such drivers, the interface is:
void lro_receive_frags(struct net_lro_mgr *lro_mgr,
struct skb_frag_struct *frags,
int len, int true_size,
void *priv, __wsum sum);
The LRO code will build the necessary sk_buff structure, perhaps
aggregating fragments from several packets, and (sooner or later) feed the
results to the network stack. It will call the driver's
get_frag_header() method to locate the headers in the first
element of the frags array; that method should also ensure that
the packet is an IPv4 TCP packet and set LRO_IPV4 and
LRO_TCP in the flags argument if so.
Combined packets will be pushed up into the network stack whenever
max_aggr individual packets have been merged into them. Delaying
data for too long while waiting for additional packets is not a good idea,
though; occasionally packets should be sent on even if they are not as
large as they could be. The function for this job is:
void lro_flush_all(struct net_lro_mgr *lro_mgr);
It will cause all packets to sent on. A logical place for such a call
might be at the end of a NAPI driver's poll() method. An
individual stream can be flushed with:
void lro_flush_pkt(struct net_lro_mgr *lro_mgr,
struct iphdr *iph,
struct tcphdr *tcph);
This call will locate the stream associated with the given IP and TCP
headers and send its accumulated data onward. It will not add any
data associated with the given headers; the packet associated with those
headers should have already been added with one of the receive functions if
need be.
That is, for all practical purposes, the entire interface. One might well
wonder how this code can improve performance, given that it is just
aggregating packets which have already been received in the usual way by
the driver. The answer is that it is reducing the number of packets that
the network stack has to work with, cutting the per-packet overhead at
higher levels in the stack. A clever driver can, using the struct
page approach, also reduce the number of memory allocations required
for each packet, which can be a big win. So LRO appears to be worth
having, and current plans call for it to be merged in 2.6.24.
Comments (1 posted)
By Jonathan Corbet
July 31, 2007
The
arch directory in the kernel source tree contains all of the
architecture-specific code. There is a lot of code there, despite years of
work by the development community to make things generic whenever
possible. There are currently 26 different top-level architectures
supported by Linux, many of which contain a number of sub-architectures.
Two of those top-level architectures are i386 (the original Linux architecture)
and x86_64, which is the
64-bit big brother to i386. There is quite a bit of commonality between
those two architectures, and some efforts have been made to share code
between them whenever possible. Even so, the source trees for the two
architectures remain distinct from each other.
In the view of some developers, at least, the separation of the two
architecture trees is a problem. A bug fix which must be applied to one
often is applicable to the other, but it's not clear that all fixes are
being made in both places. New features, too, must be added twice. It is
relatively easy to break one architecture while working on the other.
Developers working on architecture-specific projects - virtualization is
mentioned often - end up having to do a lot of work to keep up with two
strongly related trees. In response to this kind of pressure, the 32-bit
and 64-bit PowerPC architectures were merged into a single architecture
tree in 2.6.15, and the general consensus seems to be that it was a good
move. But no such merger has happened for the x86 variants.
That may be about to change, though: Thomas Gleixner and Ingo Molnar
recently posted a patch to merge the two
architectures with a request for comments. This patch is huge: it
weighs in at over 9MB and touches 1764 files. It is so tied to the current
state of the kernel tree that it can only be reasonably applied to one
specific commit point in the git repository. This is not the patch which
is meant to be applied, though; its purpose is to show what the final
result would look like. If and when the time comes to actually merge this
patch, it will be done differently:
As a next step we plan to generate a gradual, fully bisectable,
fully working switchover from the current code to the fully
populated arch/x86 tree. It will result in about 1000-2000 commits.
That is a little intimidating as well. Knowing this, the developers of
this patch have gone out of their way to make it possible to apply the
change with a high level of confidence. So there will be no code changes
associated with the merger: it will be possible to build the exact same
kernel image from the source tree before and after the change.
The patch creates a new architecture called x86 and moves
everything from the two existing architectures over. In the small number
of cases where each architecture has an identical copy of the same file,
only a single file is created in the new tree. More often, though, the two
architectures have a file by the same name in the same place, but their
contents differ. In such cases, both files are moved into the new tree
with a _32 or _64 suffix, depending on where it came
from. So, for example, both architectures contain
kernel/ioport.c; the new x86 architecture has
ioport_32.c and ioport_64.c. Some simple trickery is
then employed to ensure that the correct files for the target architecture
are built.
In many (if not most) cases, there is a great deal of common code in the
two files, and that common code is left there. The idea at this stage of
the game is to get the two architecture trees together without affecting
the resulting kernel; that is probably the only way that such a big change
would ever be accepted. Once things have been merged, the opportunities
for eliminating duplicated code between individual files will become more
apparent - the files will usually be right next to each other. One
imagines that an army of code janitors would swoop in to do this work, much
of which would be relatively straightforward. Once it's done, we would
have a shiny new, merged architecture with duplicated code squeezed out,
and everybody would be happy.
Or maybe not. Andi Kleen has expressed his
opposition to this change:
I think it's a bad idea because it means we can never get rid of
any old junk. IMNSHO arch/x86_64 is significantly cleaner and
simpler in many ways than arch/i386 and I would like to preserve
that. Also in general arch/x86_64 is much easier to hack than
arch/i386 because it's easier to regression test and in general has
to care about much less junk. And I don't know of any way to ever
fix that for i386 besides splitting the old stuff off completely.
Andi, by virtue of being the maintainer of the i386 and x86_64
architectures, has a relatively strong voice in this discussion. His core
argument - that splitting the architectures allows lots of legacy issues to
be confined to the i386 tree - reflects a common practice in kernel code
management. Code which only supports relatively new hardware tends to be a
lot cleaner than code which handles older devices as well, but removal of
support for hardware which is still in use is frowned upon. So, instead, a
new subsystem is created for the newer stuff, with the idea that the legacy
code can be supported separately until it withers away. A classic example
is the way that serial ATA support was implemented within its own subsystem
instead of being an addition to the IDE code. Andi, along with a few
others, argues that x86-family processor support should be handled in the
same way.
Most of the participants in the early discussion would appear to disagree
with Andi, though. Unlike legacy IDE devices, it is argued, the 32-bit
architecture is not going to disappear anytime soon. The number of quirks
which are truly unique to the i386 architecture is seen as being relatively
small. Linus argues that it's easier to
carry forward legacy code when it's part of a shared tree than when it's
shoved off into a corner. Judging from the conversation which followed the
initial posting, there is a near-consensus that the unified tree is the
right way to go.
There were a couple of suggestions that the patch could go directly into
2.6.23, but it is probably just as well that things did not happen that
way. 2.6.23 has a lot of new stuff already, and this patch is new.
Allowing a cycle for the work to age can only be helpful. Besides, we have
not yet seen a repository with those 1000 or so separate commits in it.
More to the point, though: the real
discussion on the merger has not yet happened. To rework two architectures
into one over the objections of the maintainer would be an extraordinary
step verging on a hostile takeover of the code. Maintainers do not have
absolute veto power over patches, but overriding a maintainer on a patch
this big is not something which is done lightly. So the developers of the
unified x86 architecture patch have one big challenge remaining: they have
solved the technical issues nicely, and they have convinced much of the
development community that this change should be made. But it would be in
the best interests of everybody involved if they could find a way to
convince the maintainer of the code they are working with as well.
Comments (9 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>