Release status
Kernel release status
The current stable 2.6 release remains 2.6.11.7; it dates back to
April 7. A set of patches has been proposed for the .8 release, but
there is some debate over a couple of them.
The current 2.6 prepatch remains 2.6.12-rc3; Linus has released no
prepatches over the last week. About 100 patches have found their way into
his git repository, however; they include a tg3 driver update, a "simple
action" capability for the packet scheduler, and various fixes.
There has not been a -mm release since 2.6.12-rc2-mm3 on April 11. Andrew is
still getting caught up from his travels and the SCM changes.
The current 2.4 prepatch is 2.4.31-pre1, released by Marcelo on April 25. It
consists of a very small set of patches, most of which are x86-64 fixes.
Comments (none posted)
Kernel development news
Andrew Morton at linux.conf.au
![[LCA]](/images/conf/lca2005/lca.png)
The Friday morning linux.conf.au keynote was delivered by Australian
expatriate Andrew Morton; his wide-ranging talk touched on many aspects of
the kernel development process.
Andrew has brought a different approach to kernel development, and it
showed early in the talk. He noted that Linus has often characterized his
job as being rejecting patches, rather than accepting them. Andrew
disagrees with that approach. If somebody has gone to the trouble to put
together a patch, even a really poorly-done one, there was probably some
sort of underlying need which motivated that work. A patch identifies a
problem, at least for some users; you can't just reject it or the kernel as
a whole will lose out. So Andrew sees his role as helping to get patches
into the kernel, rather than taking pride in rejecting them.
According to Andrew, anybody who goes to the trouble of submitting a patch
deserves a response.
If the patch is not merged, the developer is entitled to an explanation of
why.
He does not want to have to understand all of those patches himself,
however. It's up to the subsystem maintainers to evaluate patches and,
eventually, merge them. Andrew's job is to get the maintainers to get
involved. Techniques he can employ include the "troll merge," simply
adding the patch to -mm to force the maintainer to react. Asking "dumb
questions" on the mailing lists can also help. One way or another, Andrew
works to get a response from the relevant maintainers.
Andrew's goal is to bring more professionalism to the kernel development
process. He believes that is happening; among other things, he notes,
patch traffic now slows down significantly on weekends - that was not
always the case. He'd like to settle down the process, and, eventually,
hand off pieces of it to others. One such piece, most likely, would be bug
tracking. He cautioned, however, that these kernel maintenance tasks are
not part-time jobs.
The new development model was revisited; much of what was said will be
familiar to LWN readers. He noted that the older process failed one of the
kernel's most important customers: the distributors. By getting features
merged, tested, and ready for deployment quickly, the new process serves
the distributors better. There has, perhaps, been some cost to another set
of customers: those who run the mainline kernel on their systems. Andrew
will be working hard to increase the stability of the mainline releases to
make life easier for that group of users.
Meanwhile, he notes, the developers are shoveling about 10MB of patches
into the kernel every month.
The stable 2.6 series (currently at 2.6.11.7) is, according to Andrew, not
sure to succeed. He believes that it does not get enough developer
attention, and that the bar for patches has been set too high. And it does
not address the real problem: that mainline releases have regressions that
cause breakage for some users. Really fixing the problems, he says,
requires getting the developers to be more careful and more focus on fixing
known bugs. He says the process might yet move to an even/odd release
scheme, where even-numbered releases (2.6.14, say) would be limited to bug
fixes.
On testing: Andrew notes that, while the development process is highly
dependent on a large community of testers, it has no real way of rewarding
them for their work. He will look into acknowledging testers in the kernel
changelogs; if you helped to find a bug, your name can appear alongside
that of the developer who fixed it.
On the BitKeeper front, Andrew stated that he was never entirely happy with
the decision to use that tool. It imposed an opportunity cost: had the
kernel hackers gone off three years ago to build the source code management
system they really needed, they would have something quite nice by now. He
noted that version control appears to be one of those problems which drives
developers crazy, and that's a problem. If you depend on a tool with
insane developers, things will "end in tears." Now he's keeping his head
down and waiting to see how the whole thing settles out.
Finally, he noted that many developers who think they need a source code
management system really don't. If your real purpose is to keep a set of
patches in sync with an evolving mainline kernel - which is the case for
many developers - then a tool like quilt makes
more sense.
Comments (10 posted)
Supporting RDMA on Linux
RDMA (remote direct memory access) is an attempt to extend the DMA
mechanism to a networked environment. Using RDMA, an application can
quickly transfer the contents of a memory buffer to a buffer on a remote
system. On high-speed, local-area networks, RDMA transfers are intended to
be significantly faster than transfers done with the regular socket
interface. Not everybody likes the RDMA way of doing things, but it exists
regardless, and some users expect to see it supported by Linux.
Implementations exist for InfiniBand and a number of high-speed Ethernet
adaptors.
Since the goals of RDMA include speed and low CPU overhead, implementations
attempt to bypass as much kernel processing as possible. Typically, they
simply pass the address of a user-space buffer directly to the hardware,
and expect that hardware to do the rest. Drivers which need to make
user-space memory available to their hardware will call
get_user_pages(), which achieves two useful things: it pins the
pages into physical memory, and generates an array of physical addresses
for the driver to use. The current RDMA implementations use this approach,
but they have run into a problem: get_user_pages() was never
designed for the usage patterns seen with RDMA.
The typical driver which calls get_user_pages() keeps the pages
pinned for a very short period of time. Often, the pages will be released
before the driver returns to user space. Sometimes, usually when
asynchronous I/O is used, the release of the pages will be delayed for a
short period, but only as long as it takes the I/O operation to complete.
The problem is that RDMA operations do not "complete" in this manner. An
RDMA user can reasonably set up a buffer, pass a descriptor to a remote system, and
expect data to show up in the buffer sometime next week. The whole idea is
to do the relatively expensive buffer setup once, then be able to transfer
the (changing) contents of that buffer an arbitrary number of times. So
pages pinned by the driver can remain pinned for a very long time.
Several problems come up in this scenario. get_user_pages() does
not do any sort of privilege checking or resource accounting for the pages
it pins; it's supposed to be a short-term operation. So a hostile
application could use an RDMA interface to lock down large amounts of
memory indefinitely, effectively shutting down the system. There is no
mechanism for notifying the driver if the process owning the pages exits,
so cleanup can be a problem. There are also interactions with the virtual
memory system to worry about: if the process forks (causing its data pages
to be marked copy-on-write) and writes to a pinned page, it will get a new
copy of that page and will become disconnected from its pinned buffer.
Various approaches to solving these problems have been discussed. The
resource accounting issues can be partially solved by requiring the process
to lock the pages itself (using mlock()) before setting them up
for RDMA; that will bring the normal kernel resource limits into play.
There are still potential problems if the process is allowed to unlock the
pages while the RDMA buffer still exists, however, so some changes would
have to be made to prevent that case. Current implementations have dealt
with the process exit issue by setting up a char device as the control
interface for the RDMA buffer; when the device is closed, all RDMA
structures are torn down. The copy-on-write problem can be addressed by
forcing RDMA buffers to be in their own virtual memory area (VMA) and
setting the VM_DONTCOPY flag on that VMA, preventing the pages
from being made available to any child processes. This approach would
require that RDMA buffers occupy whole pages by themselves.
Then there are little issues like what happens when the process creates
overlapping RDMA buffers. The whole thing gets a little complicated.
All of this can clearly be patched together, but it is inelegant at best,
and is clearly getting complicated.
So an entirely different approach has been
proposed by David Addison. This technique does away with the need to pin
RDMA buffers entirely, but would, instead, require network drivers to
become rather more aware of how the virtual memory subsystem works.
David's patch assumes that the network interface device contains a simple
memory management unit of its own, and can deal with its own paging
details. This assumption turns out to be true for a number of contemporary
high-speed cards. These cards can translate addresses and properly ask for
help if they need to access a page which is not currently resident in
memory. Thus, when using this sort of card, RDMA buffers can be set up
without the need to pin them in memory; the hardware will cause them to be
faulted in when the time comes.
Needless to say, the hardware will need a considerable amount of help in
this process; it cannot be expected to work with the host system's page
tables, cause page faults to happen on its own, etc. So the card's MMU
must be loaded with a minimal set of page mappings which describe the RDMA
buffer(s), and those mappings must be kept in sync as things change on the
system. With that in place, the card can perform DMA to resident pages,
and ask the driver for help with the rest.
The device driver can load the initial page tables, but it will need help
from the kernel to know when the host system's page tables change. To that
end, David's patch defines a structure with a new set of hooks into the
virtual memory subsystem:
typedef struct ioproc_ops {
struct ioproc_ops *next;
void *arg;
void (*release)(void *arg, struct mm_struct *mm);
void (*sync_range)(void *arg, struct vm_area_struct *vma,
unsigned long start, unsigned long end);
void (*invalidate_range)(void *arg, struct vm_area_struct *vma,
unsigned long start, unsigned long end);
void (*update_range)(void *arg, struct vm_area_struct *vma,
unsigned long start, unsigned long end);
void (*change_protection)(void *arg, struct vm_area_struct *vma,
unsigned long start, unsigned long end,
pgprot_t newprot);
void (*sync_page)(void *arg, struct vm_area_struct *vma,
unsigned long address);
void (*invalidate_page)(void *arg, struct vm_area_struct *vma,
unsigned long address);
void (*update_page)(void *arg, struct vm_area_struct *vma,
unsigned long address);
} ioproc_ops_t;
An interested driver can fill in one of these structures with its methods,
then attach it to a given process's mm_struct structure with a
call to ioproc_register_ops(). Thereafter, calls to those
functions will be made whenever things change.
The release() method will be called when the process exits; it
allows the driver to perform a full cleanup. The sync_range() and
sync_page() methods indicate that the given page(s) have been
flushed to disk; this tells the driver that, should the interface modify
those pages, they must be marked dirty again. invalidate_range()
and invalidate_page() inform the driver that the given page(s) are
not longer valid - they have been swapped out or unmapped. Calls to
update_range() and update_page() happen when a valid page
table entry is written; when a page is brought in, mapped, etc. The
change_protection() function is called when page protections are
changed.
The patch has already, apparently, been looked over by Andrew Morton and
Andrea Arcangeli, so one might assume that there would not be a great many
show stoppers there. The comments posted so far have had to do mostly with
coding style, though one poster noted that
it might make more sense to attach the hooks to the VMA structure, rather
than the top-level memory management structure. Unfortunately, the patch
does not include any code which actually uses the proposed hooks,
making it harder to see how a driver might employ them.
Meanwhile, conversations
continue on how an interface using page pinning could be made to work. A
real solution may be some time yet in coming.
Comments (2 posted)
FUSE and private namespaces
Two weeks ago, we
looked at the
opposition to FUSE, or, more specifically, to the strange filesystem
semantics it implements. FUSE overrides the VFS permission checking code
to establish its own set of rules; the intent is to keep users (even root)
from accessing each other's private filesystems. Few people dispute the
goal, but the approach that was used failed to please.
FUSE hacker Miklos Szeredi has tried to address the concerns with a new patch implementing "private mounts."
The patch creates a new mount flag (MNT_PRIVATE); if that flag is
set, then only processes belonging to the owner of the mount can see the
mounted filesystem at all. To all other processes on the system, these
private mounts would be entirely invisible. With this change in place, the
permission checking change is no longer needed.
Unfortunately, nobody likes this idea either. This patch creates a
different set of filesystem semantics; in this case, setuid programs run by
a user who has private mounts will see a different filesystem than any
other process. The filesystem hackers do not wish to see namespaces which
change in surprising ways.
So what is the solution here? Linux does allow for different
processes to have different views of the filesystem ("namespaces"). The
namespace mechanism could be brought into play to hide FUSE mounts. The
problem is that namespaces were never really meant to be shared across the
system. A namespace is a process attribute, like the controlling terminal;
it is inherited by child processes, but there is no mechanism for passing a
namespace to a process which has not inherited it. Users would like to
mount their private filesystems and have them available to all of their
processes on the system, so having those filesystems in a namespace which
is only available to one process tree does not solve the problem.
As it turns out, there is one way to access namespaces outside of the
creating process tree. Jamie Lokier noticed that each process's root directory is
accessible via /proc/pid/root. A new process can be put
into another process's namespace simply by setting its root with
chroot(). If all works as it seems it should, a user-space
solution can be envisioned: write a privileged daemon process which can
create namespaces and, using file descriptor passing, hand them to
interested processes. Those processes can then chroot() into that
namespace. chroot() is a privileged operation, but the code to
handle the user side of this operation could be hidden within a PAM module
and made completely invisible.
All that's left is for somebody to actually code this solution. At that
point, a glitch or two could come up, but they should be easily fixed with
small patches. So there might just be an answer to the FUSE problem after all.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>