Brief items
The current stable 2.6 release is 2.6.11.9, which was
released on May 11. It
contains a fix for the
ELF loader vulnerability and a couple
of other fixes as well.
The current 2.6 prepatch is 2.6.12-rc4, announced by Linus on
May 6. Changes this time around include more "sparse" annotations, a
CIFS update, various architecture updates, resource limits for niceness and
realtime scheduling (covered in last week's
Kernel Page), a JFS update, some networking tweaks, and more. See the long-format changelog for the details.
Linus is currently on vacation, so no new patches have been added to his
git repository since -rc4.
The latest -mm release is 2.6.12-rc3-mm3.
Recent changes to -mm include a rework of the huge page code, a bunch of
UML updates, a device mapper update, and more fixes.
Comments (3 posted)
Kernel development news
The coding style document packaged with the kernel source contains a number
of clear rules; here's one of them:
Don't put multiple statements on a single line unless you have
something to hide:
if (condition) do_this;
do_something_everytime;
Jesper Juhl recently found some code which evidently had something to hide,
and submitted a patch to break the
offending if
statements onto two lines. Andrew Morton rejected it:
There are about 88 squillion of these in the kernel. I think it
would be a mistake for me to start taking such patches, sorry.
In further discussion, however, Andrew seemed to agree that, perhaps,
cleaning up the kernel source to be more generally compliant with the
coding style documentation might be a good thing. He just doesn't want to
cope with hundreds of little patches to that end. He will, however,
consider a small number of very large patches.
So a major coding style cleanup seems likely to happen, perhaps before
2.6.12 comes out. Applying this sort of patch so late in the cycle
should be safe; the intent is to change the formatting, but to make
no actual code changes. Andrew also plans
to drop any changes which do not apply against the -mm tree, in the hopes
of minimizing the effects of the changes on patches maintained by other
developers.
If all goes according to this plan, the final 2.6.12 patch could be large
indeed.
Comments (10 posted)
Markus Klotzbuecher recently
announced the
release of mini_fo 0.6.0. Mini_fo provides (what has been called in other
systems) a "translucent" or "copy on write" filesystem. A read-only, base
filesystem (possibly from a remote system or CDROM) can be made to appear,
via mini_fo, as a local, writable filesystem. This functionality is useful
for sharing filesystems with local overrides, live CD systems, sandboxing
applications, and more.
At its core, mini_fo performs a simple fan-out operation. Each inode,
dentry, and file structure associated with a mini_fo filesystem contains
(via its private data) pointers to two other structures of the same type.
One of them refers to the file or directory on the base filesystem; the
other, instead, is for a local version of the file or directory on a local
"storage filesystem." Both are hidden from user space, which thinks it is
dealing directly with a file stored in the mini_fo filesystem.
When a mini_fo filesystem is first created, it appears as an exact copy of
the underlying base filesystem. Any operation which reads files or
directories is simply passed through to the base filesystem, with almost no
additional overhead. In this mode, mini_fo functions as a sort of loopback
filesystem.
Things change, however, when a file is opened for writing. In this case,
mini_fo will create a copy of the file on the storage filesystem, with all
of the data moved over. Any subsequent operations on that file will used
the locally-stored version rather than the base version. So any changes
made will appear locally, but they will not be propagated back to the
base. Changes will be persistent across mounts as long as the storage
directory used by mini_fo is not modified by anything except mini_fo.
Modified files are not the full story, of course; mini_fo must also cope
with operations like deletes and renames. To that end, it maintains a set
of lists of files which it knows about locally; there is one list for
modified files, one for deleted files, one for files created locally, etc.
These lists are stored in-kernel as standard linked lists. They are also
written to the storage filesystem in a magic file (named
META_dAfFgHE39ktF3HD2sr, for what it's worth) and reloaded from
that file when the filesystem is mounted.
This release of mini_fo works with both the 2.4 and 2.6 kernels. Its
author claims that it is intended for use with embedded systems, and thus
has a small memory footprint. See the mini_fo web
page for more information.
Comments (10 posted)
When a new process is created with the
clone() system call, a set
of flags is provided which tells the kernel which resources, if any, should
be shared between that process and its parent. Potentially shareable
resources include virtual memory, open files, signal handlers, and more.
New processes also share, by default, the filesystem namespace seen by
their parent (and, usually, by the system as a whole).
In the current Linux kernel, the sharing decisions made at clone()
time last for the lifetime of the processes involved. There is not usually
a reason to change resource sharing, but recent discussions on supporting
private mounts (with the filesystems in user space patch, or otherwise)
have suggested that it would actually be useful for a process to be able to
"unshare" resources after its creation. In particular, if a process could
detach itself from the global filesystem namespace and create its own, it
would be possible to set up that new namespace with whatever private mounts
that process needs. If this functionality were
used within a PAM module, it would be relatively easy for administrators to
set up per-user views of the filesystem, complete with private mounts.
To that end, Jenak Desai has posted a patch
adding a new unshare() system call. The interface is simple
enough:
long unshare(unsigned long flags);
The flags argument can be CLONE_NEWNS (to create a new
filesystem namespace), CLONE_VM (to establish a private virtual
address space) or CLONE_SIGHAND (to unshare signal handlers). If
all goes well, when the call returns, the designated resource(s) will now
be private to the calling process; otherwise the situation is unchanged.
This patch has not yet made it to the linux-kernel mailing list, and may
see some changes before it is considered for inclusion.
Comments (none posted)
Execute-in-place (XIP) support for the Linux kernel has been on the
embedded systems wishlist for some time. Such systems usually have the
kernel and relevant application images stored in a directly-accessible ROM
or flash memory. This memory generally contains a filesystem, and is
treated as a disk drive. This mechanism works, but it can be inefficient:
running a program from this memory requires that said program first be
copied into (usually scarce) RAM. It would be much better if this code
could be executed directly out of the flash-based memory.
Carsten Otte (of IBM) has posted a set of
patches adding XIP support to the 2.6 kernel. These patches, in
addition, enable fast memory-to-memory block I/O for such devices, shorting
out the page cache and most of the block layer. As a result, the XIP
patches are useful in a number of situations, such as, as Carsten notes,
for shared-memory block devices used to communicate between (virtual)
systems.
The first step is to add support at the block driver level. To that end, a
new method is added to the block_device_operations structure:
int (*direct_access) (struct inode *inode, sector_t sector,
unsigned long *data);
This method, if implemented, should come up with a kernel virtual address
corresponding to the given sector on the block device represented
by inode. That address, which must remain valid until the device is
closed, is returned in *data. The return value is zero on
success or a negative error code in case of problems.
The next step is a new method in the address_space_operations
structure:
struct page *(*get_xip_page)(struct address_space *space,
sector_t blockno, int create);
This method's job is to translate a specific block number within a
filesystem to a page structure pointing to its directly-mapped
data. It is a filesystem-specific function which will translate
blockno to a sector number on the underlying device, then use that
device's direct_access() method to get an address. Carsten has
posted an implementation for ext2 which
shows how this method can be put together.
So far, the XIP patches enable fast, memory-to-memory device access, but
they do not yet implement true execute-in-place operation. The last step
is to replace the usual nopage() VMA operation
(filemap_nopage()) with a new version
(filemap_xip_nopage()) when the underlying device and filesystem
support XIP. The new nopage() method will (using
get_xip_page()) handle page faults by causing a process's page
tables to point directly to
the on-"disk" pages, rather than reading those pages into RAM. Some other
technique will be needed to run the kernel itself in an XIP mode, but
anything that is invoked thereafter can be run directly from the memory
device.
Put the above pieces together, and Linux has a complete execute-in-place
implementation. Supporting XIP at the block level is not the only way it
could be implemented; David Woodhouse pointed
out that an alternative approach is to use a special-purpose
filesystem. Carsten's patches, however, point out a way in which any
filesystem could be made to work in an XIP mode.
Comments (10 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>