Kernel development [LWN.net]

Kernel release status

The current stable 2.6 release is 2.6.11.9, which was released on May 11. It contains a fix for the ELF loader vulnerability and a couple of other fixes as well.

The current 2.6 prepatch is 2.6.12-rc4, announced by Linus on May 6. Changes this time around include more "sparse" annotations, a CIFS update, various architecture updates, resource limits for niceness and realtime scheduling (covered in last week's Kernel Page), a JFS update, some networking tweaks, and more. See the long-format changelog for the details.

Linus is currently on vacation, so no new patches have been added to his git repository since -rc4.

The latest -mm release is 2.6.12-rc3-mm3. Recent changes to -mm include a rework of the huge page code, a bunch of UML updates, a device mapper update, and more fixes.

Comments (3 posted)

The coding style enforcer

The coding style document packaged with the kernel source contains a number of clear rules; here's one of them:

Don't put multiple statements on a single line unless you have something to hide:

        if (condition) do_this;
          do_something_everytime;

Jesper Juhl recently found some code which evidently had something to hide, and submitted a patch to break the offending if statements onto two lines. Andrew Morton rejected it:

There are about 88 squillion of these in the kernel. I think it would be a mistake for me to start taking such patches, sorry.

In further discussion, however, Andrew seemed to agree that, perhaps, cleaning up the kernel source to be more generally compliant with the coding style documentation might be a good thing. He just doesn't want to cope with hundreds of little patches to that end. He will, however, consider a small number of very large patches.

So a major coding style cleanup seems likely to happen, perhaps before 2.6.12 comes out. Applying this sort of patch so late in the cycle should be safe; the intent is to change the formatting, but to make no actual code changes. Andrew also plans to drop any changes which do not apply against the -mm tree, in the hopes of minimizing the effects of the changes on patches maintained by other developers.

If all goes according to this plan, the final 2.6.12 patch could be large indeed.

Comments (10 posted)

The mini_fo filesystem

Markus Klotzbuecher recently announced the release of mini_fo 0.6.0. Mini_fo provides (what has been called in other systems) a "translucent" or "copy on write" filesystem. A read-only, base filesystem (possibly from a remote system or CDROM) can be made to appear, via mini_fo, as a local, writable filesystem. This functionality is useful for sharing filesystems with local overrides, live CD systems, sandboxing applications, and more.

At its core, mini_fo performs a simple fan-out operation. Each inode, dentry, and file structure associated with a mini_fo filesystem contains (via its private data) pointers to two other structures of the same type. One of them refers to the file or directory on the base filesystem; the other, instead, is for a local version of the file or directory on a local "storage filesystem." Both are hidden from user space, which thinks it is dealing directly with a file stored in the mini_fo filesystem.

When a mini_fo filesystem is first created, it appears as an exact copy of the underlying base filesystem. Any operation which reads files or directories is simply passed through to the base filesystem, with almost no additional overhead. In this mode, mini_fo functions as a sort of loopback filesystem.

Things change, however, when a file is opened for writing. In this case, mini_fo will create a copy of the file on the storage filesystem, with all of the data moved over. Any subsequent operations on that file will used the locally-stored version rather than the base version. So any changes made will appear locally, but they will not be propagated back to the base. Changes will be persistent across mounts as long as the storage directory used by mini_fo is not modified by anything except mini_fo.

Modified files are not the full story, of course; mini_fo must also cope with operations like deletes and renames. To that end, it maintains a set of lists of files which it knows about locally; there is one list for modified files, one for deleted files, one for files created locally, etc. These lists are stored in-kernel as standard linked lists. They are also written to the storage filesystem in a magic file (named META_dAfFgHE39ktF3HD2sr, for what it's worth) and reloaded from that file when the filesystem is mounted.

This release of mini_fo works with both the 2.4 and 2.6 kernels. Its author claims that it is intended for use with embedded systems, and thus has a small memory footprint. See the mini_fo web page for more information.

Comments (10 posted)

A system call for unsharing

When a new process is created with the clone() system call, a set of flags is provided which tells the kernel which resources, if any, should be shared between that process and its parent. Potentially shareable resources include virtual memory, open files, signal handlers, and more. New processes also share, by default, the filesystem namespace seen by their parent (and, usually, by the system as a whole).

In the current Linux kernel, the sharing decisions made at clone() time last for the lifetime of the processes involved. There is not usually a reason to change resource sharing, but recent discussions on supporting private mounts (with the filesystems in user space patch, or otherwise) have suggested that it would actually be useful for a process to be able to "unshare" resources after its creation. In particular, if a process could detach itself from the global filesystem namespace and create its own, it would be possible to set up that new namespace with whatever private mounts that process needs. If this functionality were used within a PAM module, it would be relatively easy for administrators to set up per-user views of the filesystem, complete with private mounts.

To that end, Jenak Desai has posted a patch adding a new unshare() system call. The interface is simple enough:

    long unshare(unsigned long flags);

The flags argument can be CLONE_NEWNS (to create a new filesystem namespace), CLONE_VM (to establish a private virtual address space) or CLONE_SIGHAND (to unshare signal handlers). If all goes well, when the call returns, the designated resource(s) will now be private to the calling process; otherwise the situation is unchanged.

This patch has not yet made it to the linux-kernel mailing list, and may see some changes before it is considered for inclusion.

Comments (none posted)

Execute-in-place

Execute-in-place (XIP) support for the Linux kernel has been on the embedded systems wishlist for some time. Such systems usually have the kernel and relevant application images stored in a directly-accessible ROM or flash memory. This memory generally contains a filesystem, and is treated as a disk drive. This mechanism works, but it can be inefficient: running a program from this memory requires that said program first be copied into (usually scarce) RAM. It would be much better if this code could be executed directly out of the flash-based memory.

Carsten Otte (of IBM) has posted a set of patches adding XIP support to the 2.6 kernel. These patches, in addition, enable fast memory-to-memory block I/O for such devices, shorting out the page cache and most of the block layer. As a result, the XIP patches are useful in a number of situations, such as, as Carsten notes, for shared-memory block devices used to communicate between (virtual) systems.

The first step is to add support at the block driver level. To that end, a new method is added to the block_device_operations structure:

    int (*direct_access) (struct inode *inode, sector_t sector, 
                          unsigned long *data);

This method, if implemented, should come up with a kernel virtual address corresponding to the given sector on the block device represented by inode. That address, which must remain valid until the device is closed, is returned in *data. The return value is zero on success or a negative error code in case of problems.

The next step is a new method in the address_space_operations structure:

    struct page *(*get_xip_page)(struct address_space *space, 
                                 sector_t blockno, int create);

This method's job is to translate a specific block number within a filesystem to a page structure pointing to its directly-mapped data. It is a filesystem-specific function which will translate blockno to a sector number on the underlying device, then use that device's direct_access() method to get an address. Carsten has posted an implementation for ext2 which shows how this method can be put together.

So far, the XIP patches enable fast, memory-to-memory device access, but they do not yet implement true execute-in-place operation. The last step is to replace the usual nopage() VMA operation (filemap_nopage()) with a new version (filemap_xip_nopage()) when the underlying device and filesystem support XIP. The new nopage() method will (using get_xip_page()) handle page faults by causing a process's page tables to point directly to the on-"disk" pages, rather than reading those pages into RAM. Some other technique will be needed to run the kernel itself in an XIP mode, but anything that is invoked thereafter can be run directly from the memory device.

Put the above pieces together, and Linux has a complete execute-in-place implementation. Supporting XIP at the block level is not the only way it could be implemented; David Woodhouse pointed out that an alternative approach is to use a special-purpose filesystem. Carsten's patches, however, point out a way in which any filesystem could be made to work in an XIP mode.

Comments (10 posted)

Linus Torvalds Linux v2.6.12-rc4 ?

Andrew Morton 2.6.12-rc3-mm3 ?

Willy Tarreau Linux 2.4.30-hf1 ?

Willy Tarreau Linux 2.4.29-hf8 ?

gh@us.ibm.com CKRM: Core patch set with Classification Engine, basic controllers ?

Con Kolivas implement nice support across physical cpus on SMP ?

Ingo Molnar Real-Time Preemption, -RT-2.6.12-rc4-V0.7.47-00 ?

Paul E. McKenney RCU and CONFIG_PREEMPT_RT progress ?

Guillaume Thouvenin connector: add a fork connector ?

Haoqiang Zheng swap-sched: schedule with dynamic dependency detection (2.6.12-rc3) ?

Janak Desai new system call, unshare ?

Carl Spalletta Linux-tracecalls for kernel 2.6.11.6 ?

Hien Nguyen kprobes: function-return probes ?

Petr Baudis Cogito-0.10 ?

Thomas Gleixner git tracker online ?

David Greaves Git Documentation online ?

Ian Wienand Automated Kernel Build Regression Testing ?

David S. Miller tg3: Add tagged status support ?

Duncan Sands USB ATM: new usbatm core ?

Duncan Sands USB ATM: port speedtch to new usbatm core ?

Duncan Sands [PATCH 3/5] USB ATM: driver for the Conexant AccessRunner chipset cxacru ?

Duncan Sands USB ATM: generic DSL modem driver xusbatm ?

David Greaves core-git documentation update ?

Yani Ioannou dynamic sysfs callbacks ?

Robert Love inotify. ?

Markus Klotzbuecher mini_fo-0.6.0 overlay file system ?

Carsten Otte add execute in place support ?

Carsten Otte bdev: add execute in place support ?

Carsten Otte loop: add execute in place support ?

Carsten Otte madvice/fadvice: add execute in place support ?

Carsten Otte ext2: add execute in place support ?

Carsten Otte mm/fs: add execute in place support ?

Andy Whitcroft SPARSEMEM memory model patches ?

Andy Whitcroft generify early_pfn_to_nid ?

Andy Whitcroft generify memory present ?

Andy Whitcroft sparsemem memory model ?

Andy Whitcroft sparsemem memory model for i386 ?

Olof Johansson sparsemem memory model for ppc64 ?

Andy Whitcroft sparsemem swiss cheese numa layouts ?

Andy Whitcroft sparsemem hotplug base ?

A.M. Fradley An attempt to improve the swap tokening: ?

Christoph Lameter NUMA aware slab allocator V2 ?

Ray Bryant mm: manual page migration-rc2 -- overview ?

Ray Bryant mm: manual page migration-rc2 -- add-sys_migrate_pages-rc2.patch ?

Ray Bryant mm: manual page migration-rc2 -- sys_migrate_pages-xattr-support-rc2.patch ?

Ray Bryant mm: manual page migration-rc2 -- xfs-extended-attributes-rc2.patch ?

Ray Bryant mm: manual page migration-rc2 -- xfs-migrate-page-rc2.patch ?

Ray Bryant mm: manual page migration-rc2 -- add-node_map-arg-to-try_to_migrate_pages-rc2.patch ?

Ray Bryant mm: manual page migration-rc2 -- sys_migrate_pages-cpuset-support-rc2.patch ?

Ray Bryant mm: manual page migration-rc2 -- sys_migrate_pages-mempolicy-migration-rc2.patch ?

Ray Bryant mm: manual page migration-rc2 -- sys_migrate_pages-permissions-check-rc2.patch ?

Thomas Graf textsearch infrastructure + skb_find_text() ?

David S. Miller TSO Reloaded ?

Stephen Hemminger TCP congestion infrastructure ?

Max Kellermann H.323: implement a "real" H.245 parser ?

Max Kellermann add ip_nat_h245() ?

Max Kellermann H.323: minor code style fixes ?

Max Kellermann H.323: ASN.1/PER parser ?

Max Kellermann [PATCH pom-ng 4/6] H.323: splitted ip_conntrack_h323.c into 3 sources ?

Max Kellermann H.323: H.245/ASN.1 parser ?

Max Kellermann H.323: remove struct ip_ct_h225_master ?

Greg KH Fix kernel ELF core dump privilege elevation ?

Douglas Gilbert sg3_utils-1.14 available ?

Douglas Gilbert sdparm 0.91 ?

Greg KH hotplug-ng 002 release ?

Anthony Awtrey Hotplug-Perl ?

Oleg Nesterov alternative implementation of Priority Lists ?

Erik van Konijnenburg yaird 0.0.7, a mkinitrd based on hotplug concepts ?

Derbey Nadia Automatic Kernel Tunables ?

Kernel development

Brief items

Kernel release status

Kernel development news

The coding style enforcer

The mini_fo filesystem

A system call for unsharing

Execute-in-place

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous