Kernel development [LWN.net]

Current release status

The current development kernel is 2.5.31, released by Linus on August 10. It includes an ISDN update, more driverfs work, a JFS update, a lot of ethernet driver updates, a number of ARM, Alpha, and SPARC64 updates, and more. This tree also includes the "User-mode Linux preparation" patches, which make various changes to core code needed by UML - but UML itself has not yet been merged. The long format changelog is available for people wanting the details.

Linus's BitKeeper tree - which will become 2.5.32 - currently contains Andrew Morton's controversial "printk from userspace" patch (to support boot-time message logging), the pthreads-support patches from Ingo Molnar (see below), more device model/driverfs work, a new realtime clock driver, some USB update, and the usual pile of fixes.

The latest 2.5 kernel status summary from Guillaume Boissiere is dated August 14.

The current stable is still 2.4.19. Marcelo released 2.4.20-pre2 on August 12; it includes a big S/390 update, a ReiserFS update, a number of small VM tweaks, some new netfilter modules, the "block I/O from high memory" patch, a set of NFS updates, and a very long list of other fixes and updates.

The current prepatch from Alan Cox is 2.4.20-pre2-ac2; the main item of interest in this patch is the merging of LVM2, the new Linux volume manager implementation.

Comments (none posted)

Making Linux safe for pthreads

The Linux kernel has long been criticized for its thread support. This criticism is surprising to some, since the Linux clone() system call provides a great deal of flexibility in the creation of threads that share resources with their parent process. But clone() is not enough to allow Linux to fully support the Posix thread (pthreads) standard with good performance - especially for applications which create thousands of threads.

And such applications do exist. A lot of kernel hackers dismiss highly threaded applications as being poorly written - having more threads than processors on the system is almost always a loss from a performance point of view, and truly robust thread programming is difficult. But Linux must support what users want to do, or they will use a different system. This week has seen the culmination of quite a bit of work aimed at improving the kernel's basic thread support.

The push to improve thread support began some months ago with Rusty Russell's "Futex" (fast user-space mutex) patch. Futexes allow the implementation of pthread mutexes and condition variables in a fast manner that only requires a system call when there is contention. This patch was merged in 2.5.7 and has been refined since then.

More recently, Ingo Molnar has been working on thread support issues. His first thread-local storage (TLS) patch was posted on July 25; it was merged in 2.5.29 and is still being hacked upon. The purpose of TLS, of course, is to give each thread access to a region of memory which is not shared with all other threads. Ingo's patch, which is implemented only for the x86 architecture, supports TLS with the following changes:

Doing thread-local storage right on the x86 requires using the segment mechanism. The patch sets aside a few entries in the processor's global descriptor table (GDT) to implement the TLS segments. In the most recent patch as of this writing (tls-2.5.31-D9) creates three segments: one for glibc (and, thus, pthreads), one for Wine, and one unassigned.
A new set_thread_area() system call allows library code to set up thread-local storage using one of the TLS segments.
At every context switch, the kernel copies the new process's TLS entries into the appropriate part of the GDT.

With these changes, each thread can have its own, transparent, local storage area. There was just one last complication: the x86 GDT was global and shared on SMP systems. So Ingo had to create a separate GDT for each processor, with the interesting result that context switches got a little faster.

Next problem: what if you want to create lots of threads in a quick and safe manner? The classic Unix fork() system call has a problem in that the newly-created child process could exit before the process ID is ever returned to the parent; if the parent loses this race, it can be left in a position where it no longer knows what is going on with its children. This problem can be worked around, but the workaround involves more system calls, which slow down thread creation.

Ingo's solution comes in the form of a couple of new flags to the clone() system call. The pthread library can throw in CLONE_SETTID, which causes the process ID of the new thread to be written back to a variable in the parent's address space before the new thread begins running. There is also a CLONE_SETTLS flag which causes the equivalent of a set_thread_area() call to happen as well. The result is a robust way of creating new threads with a single system call.

Finally, the pthreads code has a couple of issues to deal with when threads die. The stack used by the thread must be deallocated - and the dying thread can not do that itself. With enough system calls, pthreads handles that now, but thread exit should really be a lightweight event, and a system call-heavy solution defeats that purpose.

Much of the overhead can be eliminated if the thread library can be told about thread exit without the usual SIGCHLD signal - signals are expensive. The new pthreads code can do that with the futex mechanism - almost. It is still difficult to know, without a signal, when the thread has truly finished using its stack, so that said stack can be freed. If the stack gets freed before the thread is done with it, the result is a big mess and a new interest on the developer's part in Windows threading packages; this outcome needs to be avoided.

Ingo's first attempt to solve this problem was through the addition of an exit_free() system call, which would simply write a special value in the parent's address space to indicate that the stack could be freed. Linus, however, called this patch "too ugly to live." After some discussion, the solution that emerged was to add another clone() flag: CLONE_RELEASE_VM. If a thread is created with that flag, a word is set aside at the top of the thread's stack. When the thread releases its current virtual memory - by exiting, or by execing another program - that word is written with a flag value. The parent can see that value and know that the stack can be freed.

Finally, Ingo has posted yet another patch implementing the CLONE_DETACHED flag. If a thread is created with that flag, no signal is sent to the parent process when the thread exits. This solution is faster than having the parent simply ignore SIGCHLD, and also does not require the parent to do without notification for all of its children.

The other half of all this work, of course, is a new pthreads library that actually uses all of these new features. The code is in progress and will be part of a future glibc release. Then, maybe, people will stop complaining about thread support in Linux.

Comments (5 posted)

Memory management and patents

Linux VM hackers are engaged in ongoing discussions on both large page support (covered last week) and improving the performance of the new reverse mapping mechanism. That conversation slowed down, however, when Alan Cox pointed out that a number of the techniques being discussed are covered by SGI patents. In fact, a closer look by Daniel Phillips shows that a number of existing Linux technologies, including reverse mapping in general and the buddy allocator, are covered by these patents. This is a problem, he said, that we can't ignore.

That was Linus's cue to jump in with his policy on software patents and kernel code:

I do not look up any patents on _principle_, because (a) it's a horrible waste of time and (b) I don't want to know.

The fact is, technical people are better off not looking at patents. If you don't know what they cover and where they are, you won't be knowingly infringing on them. If somebody sues you, you change the algorithm or you just hire a hit-man to whack the stupid git.

Linus followed up with a note that the above "may not be legally tenable advice." But he sticks by his point that, anymore, it's impossible to write an interesting program without running into somebody's patent. Rather than worry about it, it's better to just proceed and deal with any problems as they emerge.

This is probably the only rational approach; otherwise kernel hackers would go nuts trying to find and avoid all of the applicable patents. It's probably only a matter of time, though, until one of these patents bites the kernel in a big way - at least in the U.S. Those are the times we live in, though.

Comments (8 posted)

NFSv4 is coming

The integration of an NFS version 4 implementation into the Linux kernel got one step closer this week when Kendrick Smith announced the availability of a set of patches for 2.5.31. These patches are not for casual users quite yet - there are 38 of them, they only implement a small part of the NFSv4 protocol, and a fair amount of work is needed to get it all going. The purpose of this set of patches is to get a conversation started toward the merging of NFSv4 into the kernel. Once the minimal code is in, the rest of the protocol (which works in a 2.4 version of the patch) can be ported forward and merged.

Comments (none posted)

Jeff Garzik Announce: daily 2.5 BK snapshots ?

Alan Cox Linux 2.4.20-pre1-ac2 ?

Alan Cox Linux 2.4.20-pre1-ac3 ?

Marc-Christian Petersen WOLK v3.5 FINAL, Codemane 'Fin' alias 'Birthday Release' "<q>Also I am a kind of happy that this is the last release of the 'Working Overloaded Linux Kernel', because I don't have the time that WOLK needs for further good development.</q>" ?

Patricia Gaughen (1/4) discontigmem support for i386 against 2.5.30 ?

Patricia Gaughen (2/4) discontigmem support for i386 against 2.5.30 ?

Patricia Gaughen (3/4) discontigmem support for i386 against 2.5.30 ?

Patricia Gaughen (4/4) discontigmem support for i386 against 2.5.30 ?

Greg Ungerer linux-2.5.31uc0 MMU-less patches ?

Jeff Dike UML - part 1 of 3 ?

Jeff Dike UML - part 2 of 3 ?

Jeff Dike UML 2.5.31 ?

john stultz tsc-disable_B9 "<q>This patch enables a workaround for multi-node NUMA systems that are experiencing gettimeofday returning "old" time values.</q>" ?

Dipankar Sarma smptimers 2.5.30 minus TIMER_BH ?

Erich Focht ACPI_NUMA for SRAT/SLIT table parsing "<q>The attached patch implements the parsing of the ACPI SRAT (Static Resource Affinity Table) and SLIT (System Locality Information Table) which are meanwhile the standard for providing NUMA information on IA64 platforms and started to spread on IA32, too.</q>" ?

Ingo Molnar tls-2.5.31-D4 ?

Ingo Molnar tls-2.5.31-D9 ?

Rusty Russell Simplified scalable cpu bitmasks ?

Stephen Hemminger fast reader/writer lock for gettimeofday 2.5.30 ?

Dominik Brodowski CPUFreq core for 2.5.31 ?

Ingo Molnar clone_startup(), 2.5.31-A0 ?

Ingo Molnar CLONE_SETTLS, CLONE_SETTID, 2.5.31-BK ?

Ingo Molnar clone-detached-2.5.31-A1 ?

Ingo Molnar exit_free(), 2.5.31-A0 ?

Ingo Molnar user-vm-unlock-2.5.31-A2 ?

Maneesh Soni dcache scalability patch [2.5] ?

Keith Owens Announce: kdb v2.3 i386 updates for kernels 2.4.18 and 2.4.19 ?

Ravikiran G Thirumalai Scalable statistics counters ?

Ravikiran G Thirumalai Scalable statistics counters using seq_file interfaces ?

Mel VM Regress - A VM regression and test tool ?

Rusty Russell (Re-xmit) kprobes for i386 ?

Jesse Barnes lock assertion macros for 2.5.31 ?

Marcin Dalecki 2.5.30 IDE 115 ?

James Hicks new driver: multimedia card (mmc) framework, patch against 2.4.19 ?

Greg KH USB changes for 2.5.31 ?

Greg KH More USB changes for 2.5.31 ?

Andrew Vasquez QLogic FC Driver for Linux 6.01b4 Released. ?

Ken Hahn BACKPACK USB (and USB2.0) now working in Linux ?

Steve Best Journaled File System (JFS) release 1.0.21 ?

Anton Altaparmakov NTFS 2.0.25 - Minor bugfixes and cleanups ?

Keith Owens Announce: XFS split patches for 2.4.19 - respin ?

Andreas Gruenbacher acl-2.0.17 and attr-2.0.9 ?

Kendrick M. Smith announcing NFSv4 patches against 2.5.31 ?

Naohiko Shimizu [RFC]Super Page for Alpha,Sparc64,i386 ?

Christoph Hellwig vmap/vunmap aka vmalloc rewrite (1st resend of 2nd implementation) ?

Andrew Morton reduce the number of tlb invalidations ?

Dipankar Sarma lockfree route lookup using RCU ?

Jari Ruusu Announce loop-AES-v1.6f file/swap crypto package ?

Airong Zhang ANNOUNCE: August Linux Test Project Announcement ?

H. Peter Anvin klibc development release "<q>klibc is a tiny C library subset intended to be integrated into the kernel source tree and being used for initramfs stuff.</q>" ?

Denis Vlasenko lk maintainers ?

Thomas Molina 2.5 Problem Report Status ?

Andrew Morton printk from userspace ?

Kernel development

Brief items

Current release status

Kernel development news

Making Linux safe for pthreads

Memory management and patents

NFSv4 is coming

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous