Release status
Kernel release status
The current 2.6 prepatch is 2.6.20-rc7,
released on January 30.
Says Linus: "
Yes, I know I said I would only do -rc6 and then the
final 2.6.20, but the thing is, the known regressions list didn't get
whittled down as quickly as I hoped, and as a result we now have a
-rc7." There's a fair number of fixes in this release, but not much
else.
Previously, 2.6.20-rc6 was
released on January 24. It includes quite a few fixes and a couple of
new memory technology device (flash) drivers.
As of this writing, no patches have been added to the mainline git
repository since the -rc7 release.
The current -mm tree is 2.6.20-rc6-mm3. Recent changes
to -mm include a big ACPI update, a new set of dynamic tick and
high-resolution timer patches, sysfs shadow directory support, a rework of
page cache accounting, preemptible RCU,
and a massive set of sysctl() cleanup patches.
For older kernels: 2.6.16.39 was released on
January 31. It fixes a relatively small number of problems, none of
which have immediately obvious security implications.
Comments (none posted)
Kernel development news
Quotes of the week
[T]he time taken to do a community graphics driver for any GPU where
specs have been available approaches infinity, unless the vendor
actually does the driver or pays someone to do the driver the hope
of a community supported driver reaching maturity while the product
is still available is slim.
--
Dave Airlie
So yes, if a user reports a bug that's attributable to a single bit
memory error that's otherwise unreproduced and unexplained, it's
totally reasonable to chalk it up to cosmic rays until some sort of
pattern of reports emerges.
--
Matt Mackall
Comments (4 posted)
Free Linux driver development offered
Greg Kroah-Hartman has sent out an offer to the hardware industry: the
kernel development community will write its device drivers for free.
"
No longer do you have to suffer through
all of the different examples in the Linux Device Driver Kit, or pick
through the thousands of example drivers in the Linux kernel source
tree trying to determine which one is the closest to what you need to
do." There is nothing new here, of course, but it is a clear
description of the benefits of providing hardware information.
Full Story (comments: 11)
A report from the Linux wireless developers meeting
OSDL The Linux Foundation ran a meeting of wireless
networking developers in London in mid-January. Attendee/organizer Stephen
Hemminger has written up a report of the event; click below for the full
text. "
Overall, the summit was very productive despite (or because of) the lack of
Internet access. The main new items coming out of it were: a commitment to
make an experimental wireless tarball (and driver) packages available; progress
on the new cfg80211 API; and an understanding of the regulatory environment
that vendors have to operate in."
Full Story (comments: 14)
A summary of 2.6.20 API changes
As of this writing the final 2.6.20 kernel has not yet happened. It is
close, however. Since any internal API changes meant for 2.6.20 should
have happened at least a month ago, it should be safe to put a summary of
the most significant changes. There have been a few of them in this kernel
cycle, some of which caused widespread churn through the code base.
- The workqueue API has seen a
major rework which requires changes in almost any code using
workqueues. In short: there are now two different types of
workqueues, depending on whether the delay feature is to be used or
not. The work function no longer gets an arbitrary data pointer; its
argument, instead, is a pointer to the work_struct structure
describing the job. If you have code which is broken by these
changes, this set of
instructions by David Howells is likely to be helpful.
- Some additional workqueue changes have been merged as well. There is
a new "freezable" workqueue type, indicating a workqueue which can be
safely frozen during the software suspend process. The new function
create_freezeable_workqueue() will create one. Another new
function, run_scheduled_work(), will cause a
previously-scheduled workqueue entry to be run synchronously. Note
that run_scheduled_work() cannot be used with delayed
workqueues.
- Much of the sysfs-related code has been changed to use struct
device in place of struct class_device. The latter
structure will eventually go away as the class and device mechanisms
are merged.
- There is a new function:
int device_move(struct device *dev, struct device *new_parent);
This function will reparent the given device to new_parent,
making the requisite sysfs changes and generating a special
KOBJ_MOVE event for user space.
- A number of kernel header files which included other headers no longer
do so. For example, <linux/fs.h> no longer includes
<linux/sched.h>. These changes should speed kernel
build times by getting rid of large number of unneeded includes, but
might break some out-of-tree modules which do not explicitly include
all the headers they need.
- The internal __alloc_skb() function has a new parameter,
being the number of the NUMA node on which the structure should be
allocated.
- The slab allocator API has been cleaned up somewhat. The old
kmem_cache_t typedef is gone;
struct kmem_cache should be used instead. The various
slab flags (SLAB_ATOMIC, SLAB_KERNEL, ...) were all
just aliases for the equivalent GFP_ flags, so they have been
removed.
- A new boot-time parameter (prof=sleep) causes the kernel to
profile the amount of time spent in uninterruptible sleeps.
- dma_cache_sync() has a new argument: the device
structure for the device doing DMA.
- The paravirt_ops code
has gone in, making it easier for the kernel to support multiple
hypervisors. Anybody wanting to port a hypervisor to this code should
note that it is somewhat volatile and likely to remain that way for
some time.
- The struct path
changes have been merged, with changes rippling through the
filesystem and device driver subsystems. In short, code accessing the
dentry pointer from a struct file pointer, which used to read
file->f_dentry, should now read
file->f_path.dentry. There are defines making the older
style of code work - for now.
- There is now a generic layer for human input devices; the USB HID code
has been switched over to this new layer.
- A new function, round_jiffies(), rounds a jiffies value up to
the next full second (plus a per-CPU offset). Its purpose is to
encourage timeouts to occur together, with the result that the CPU
wakes up less frequently.
- The block "activity function," a callback intended for the
implementation of disk activity lights in software, has been removed;
nobody was actually using it.
For those looking forward to what might happen in 2.6.21, a couple of
significant changes can be predicted. The old SA_* flags used
with request_irq() are likely to go away; the newer
IRQF_* flags should be used instead. There is also a timer API change waiting for
the next development cycle. Beyond that, a surprise or two is guaranteed;
watch LWN for the details as the patches get merged.
Comments (none posted)
Network namespaces
In recent times there has been quite a bit of attention paid to hypervisors
and full virtualization (or paravirtualization) solutions. The proponents
of the container approach - where all virtualized systems run in
well-contained sandboxes on the host's kernel - have been relatively quiet.
They have not been idle, however, as can be seen in the large amount of
work going into network namespaces.
For the container approach to work, every global resource in the system
must be wrapped in some sort of namespace. This wrapping has been done for
some relatively simple resources, such as the utsname information or
process IDs; some of the resulting code has already found its way into the
mainline. There is not a whole lot of use, however, for containers which
are completely isolated from the rest of the world; usually some sort of
networking capability is needed. For example, containers can usefully
contain a web browser (keeping it from exposing the rest of the system
should it prove vulnerable) or a web server - but only if networking
works. But containers should not be able to see each others' packet
streams, and, ideally, should be able to bind to the same ports without
interfering with each other.
Making that work requires network namespaces. These namespaces virtualize
all access to network resources - interfaces, port numbers, etc., -
allowing each container the network access it needs (but no more). As with
all other problems in computer science, the network namespace issue can be
addressed with another layer of indirection. There is a small problem with
this approach, however: the networking code is a vast pile of complex,
highly-tuned code overseen by developers who have little tolerance for
changes which introduce performance overhead or potential bugs. Getting
any sort of network namespace implementation merged is going to require
quite a bit of very careful work.
One approach can be seen in the L2 network namespace patch set
posted recently by Dmitry Mishin. These patches concentrate on the lower
levels of the network stack, trying to get proper namespaces established
for network devices and the IPv4 layer. In an attempt to minimize churn in
the networking code, the L2 namespace patch introduces the idea of the
"current network namespace," kept in a per-CPU variable. The current
namespace is implemented as a stack, with push and pop operations; in
theory, it allows all network operations to happen within the proper
namespace. Your editor was unable to convince himself that this scheme
would work properly in the face of any sort of kernel preemption, but that
may just be a matter of not having looked hard enough.
The net_device structure gains a net_ns field, providing
the namespace to which the device belongs. It is set to whatever namespace
is current when the device is created. The device lookup functions have
become namespace-aware; if a device does not belong to the current
namespace, it becomes invisible. A different version of the loopback
device is created for each namespace. Then, the IPv4 routing code has been
extended so that each namespace gets its own set of routing tables. The
code which matches incoming packets to sockets has also been made
namespace-aware; there is still a single hash table, but the namespace has
been made part of the match criteria.
Network interfaces made up of real hardware will normally remain in the
root namespace. Communication with other namespaces is made possible by
way of a "virtual Ethernet" device, included with the patch set. A virtual
device can be thought of as a wire into a restricted namespace; it presents
one device within that namespace and one in the parent (normally root)
namespace. Packets written to one end show up at the other. With the
addition of a few routing rules in the root namespace, packets meeting the
right criteria can be directed into (and out of) specific namespaces.
The L2 namespace patch provides the plumbing for the creation of little
virtualized Internets within a single system, but they do not yet provide
complete isolation. A process within its namespace can reconfigure its
interfaces, perhaps creating problems for the system as a whole.
Tightening things down is left to the L3 namespace patch, posted by
Daniel Lezcano. An L3 namespace is always the child of an L2 namespace; it
is the end of the line, however, being unable to have child namespaces of
its own. There are also no network admin capabilities in an L3 namespace;
once an L3 namespace is created, it is stuck with whatever network
configuration its parent gave it.
The end result is that a contained system can be put within an L3 namespace
and it should be able to perform networking without interfering with (or
even seeing) other systems in other namespaces.
A somewhat different approach can be seen in the network namespace patches
posted by Eric W. Biederman. Eric, aware of the challenges involved in
getting network namespaces merged, is far more concerned with the process
than the specific namespace implementation. So his patches focus mostly on
getting the internal APIs right.
The first step is to figure out how network namespaces are to be
represented. Rather than use a structure, Eric has opted for a mechanism
which marks all network-related global resources in a special way. These
resources get linked into a special section of the kernel which can be
cloned when a new namespace is created. Each global variable becomes an
offset into the per-namespace section; it must be accessed by way of a
special macro. This approach appears cumbersome, but it has a couple of
advantages. If a module with per-namespace variables is loaded, those
variables can be added to each existing namespace on the fly. And, if
namespaces are not in use, the overhead of the whole mechanism drops to
zero. This is an important feature: to have a hope of being merged, a
network namespace implementation will have to have no impact on systems
which are not using it.
The patch set (31 parts strong) then works through various parts of the
networking API, adding a namespace parameter to functions which need it.
There is no global "current namespace" concept in Eric's patches; it is,
instead, an explicit parameter everywhere. Thus, for example, every
function which creates a socket (they exist in every protocol
implementation) gets a namespace parameter. The sk_buff structure
(which represents a packet) has a namespace field assigned from either the
process creating it (for outbound packets) or the device it was received
from; the various protocol-specific functions are expected to take that
namespace into account. Functions dealing with netlink sockets get
namespace parameters, as do those which implement network device lookup, event
generation, and Unix-domain sockets. Like the L2 patches, Eric's
implementation includes a virtual network device (called "etun") which can
be use to route packets between namespaces.
Unlike the L2/L3 patches, Eric's work deals with the virtualization of the
networking-related /proc, sysctl, and sysfs interfaces. Doing so
requires adding shadow directory
support to sysfs. Shadow directories loosen the connection between
sysfs and the internal kobject hierarchy, allowing different namespaces to
see different contents in the same locations.
A key aspect of Eric's patch is that it implements little namespace
mechanism. Instead, much of the networking stack is made to test the
namespace it is given and fail if the root namespace is not in use. The
idea is to get the interfaces right first, then to start to fill in the
mechanism in relatively small pieces. The tests ensure that the network
stack will not surprise users by doing the wrong thing if it is not yet
fully prepared to handle non-root namespaces.
Despite the posting of all these patches, the amount of discussion has been
quite low. One gets the sense that the network developers have not yet
started to take these patches seriously. This issue seems unlikely to go
away, however; there remains a great deal of interest in getting container
features into the mainline kernel. Sooner or later, this discussion is
likely to take off.
Comments (none posted)
Fibrils and asynchronous system calls
The kernel's support for asynchronous I/O is incomplete, and it always has
been. While certain types of operations (direct filesystem I/O, for
example) work well in an asynchronous mode, many others do not. Often
implementing asynchronous operation is hard, and nobody has ever gotten
around to making it work. In other cases, patches have been around for
some time, but they have not made it into the mainline; AIO patches can be
fairly intrusive and hard to merge. Regardless of the reason, things tend
to move very slowly in the AIO area.
Zach Brown has decided to stir things up by asking a basic question: could
it be that the way the kernel implements AIO is all wrong? The current
approach adds a fair amount of complexity, requiring explicit AIO handling
in every subsystem which supports it. IOCB structures have to be passed
around, and kernel code must always check whether it is supposed to block
on a given operation or return one of two "it's in the works" codes. It
would be much nicer if most kernel operations could simply be invoked
asynchronously without having to clutter them up with explicit support.
To that end, Zach has posted a
preliminary patch set which simplifies asynchronous I/O support
considerably, but doesn't stop there: it also makes any system call
invokable in an asynchronous mode. The key is a new type of in-kernel
lightweight thread known as a "fibril."
A fibril is an execution thread which only runs in kernel space. A process
can have any number of fibrils active, but only one of them can actually
execute in the processor(s) at any given time. Fibrils have their own
stack, but otherwise they share all of the resources of their parent
process. They are kept in a linked list attached to the task structure.
When a process makes an asynchronous system call, the kernel creates a new
fibril and executes the call in that context. If the system call completes
immediately, the fibril is destroyed and the result goes back to the
calling process in the usual way. Should the fibril block, however, it
gets queued and control returns to the submitting code, which can then
return the "it's in progress" status code. The "main" process can then run
in user space, submit more asynchronous operations, or do just about
anything else.
Sooner or later, the operation upon which the fibril blocked will
complete. The wait queue entry structure has been extended to include
information on which fibril was blocked; the wakeup code will find that
fibril and make it runnable by adding it to a special "run queue" linked
list in the parent task structure. The kernel will then schedule the
fibril for execution, perhaps displacing the "main" process. That fibril
might make some progress and block
again, or it may complete its work. In the latter case, the final exit
code is saved and the fibril is destroyed.
By moving asynchronous operations into a separate thread, Zach's patch
simplifies their implementation considerably - with few exceptions, kernel
code need not be changed at all to support asynchronous calls. The
creation of fibrils is intended to make it all happen quickly - fibrils are
intended to be less costly than kernel threads or ordinary processes. Their
one-at-a-time semantics help to minimize the concurrency issues which might
otherwise come up.
The user-space interface starts with a structure like this:
struct asys_input {
int syscall_nr;
unsigned long cookie;
unsigned long nr_args;
unsigned long *args;
};
The application is expected to put the desired system call number in
syscall_nr; the arguments to that system call are described by
args and nr_args. The cookie value will be
given back to the process when the operation completes. User space can
create an array of these structures and pass them to:
long asys_submit(struct asys_input *requests, unsigned long nr_requests);
The kernel will then start each of the requests in a fibril and return to
user space. When the process develops an interest in the outcome of its
requests, it uses this interface:
struct asys_completion {
long return_code;
unsigned long cookie;
};
long asys_await_completion(struct asys_completion *comp);
A call to asys_await_completion() will block until at least one
asynchronous operation has completed, then return the result in the
structure pointed to by comp. The cookie value given at
submission time is returned as well.
Your editor notes that the current asys_await_completion()
implementation does not check to see if any asynchronous operations are
outstanding; if none are, the call is liable to wait for a long time.
There are a number of other issues with the patch set, all acknowledged by
their author. For example, little thought has been given to how fibrils
should respond to signals. Zach's purpose was not to present a completed
work; instead, he wants to get the idea out there and see what people think
of it.
Linus likes the idea:
Yee-haa! [...]
I heartily approve, although I only gave the actual patches a very cursory
glance. I think the approach is the proper one, but the devil is in the
details. It might be that the stack allocation overhead or some other
subtle fundamental problem ends up making this impractical in the end, but
I would _really_ like for this to basically go in.
There are a lot of details - Linus noted that there is no limit on how many
fibrils a process can create, for example - but this seems to be the way that he would
like to see AIO implemented. He suggests that fibrils might be useful in
the kevent code as well.
On the other hand, Ingo Molnar is opposed
to the fibril approach; his argument is long but worth reading. In Ingo's
view, there are only two solutions to any operating system problem which
are of interest: (1) the one which is easiest to program with, and
(2) the one that performs the best. In the I/O space, he claims, the
easiest approach is synchronous I/O calls and user-space processes. The
fastest approach will be "a pure, minimal state machine" optimized for the
specific task; his Tux web server is given as an example.
According to Ingo, the fibril approach serves neither goal:
Now where do all these LWP, fibre, firbril, micro-thread or N:M
concepts fit? Most of the time they are just a /weakening/ of the
#1 concept. And that's why they will lose out, because #1 is all
about programmability and they don't offer anything new: because
they cannot. Either you go for programmability or you go for
performance. There is /no/ middle ground for us in the kernel!
Ingo makes the claim that Linux is sufficiently fast at switching between
ordinary processes that the advantages offered by fibrils are minimal at
best, and not worth their cost. Anybody wanting performance will still
have to face the full kernel AIO state machine. So, he says, there is no
real advantage to fibrils at this time that are worth the cost of
complicating the scheduler and moving away from the 1:1 thread model.
These patches are in an early stage, and this story will clearly take some
time to play out. Even if a consensus develops in favor of the fibril
idea, the process of turning them into a proper, robust kernel feature
could make them too expensive to be worthwhile. But it's an interesting
idea which brings a much-needed fresh look at how the kernel does AIO; it's
hard to complain too much about that.
Comments (9 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Memory management
Networking
Architecture-specific
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>