Kernel development [LWN.net]

Kernel release status

The current stable 2.6 kernel is 2.6.11.10, released on May 16 in response to yet another serious security hole.

The current 2.6 prepatch remains 2.6.12-rc4. Linus has returned from his vacation and has merged about 150 patches into his git repository; these patches consist almost exclusively of security fixes, architecture updates, and various other important fixes.

The current -mm tree is 2.6.12-rc4-mm2. Recent additions to -mm include the IPSec tree, some KProbes work, the fork connector patch (for process accounting), a DVB update, an ALSA update, a NUMA-aware slab allocator, and more fixes. Note that there is now a mailing list for people who would like to be notified when patches are added to -mm; see the 2.6.12-rc4-mm2 introduction for subscription information.

The current 2.4 prepatch is 2.4.31-pre2, which was released by Marcelo on May 12. It contains a fix for the ELF core dump vulnerability and a small number of other patches.

Comments (none posted)

Is hyperthreading dangerous?

Hyperthreading (or symmetric multi-threading) is a hardware technique used to squeeze more performance out of modern processors. A hyperthreaded processor appears, in many ways, to be a set of two independent processors. These two processors share the same hardware, however, with only the processor registers and other state-dependent information being kept separate. Only one of the two CPUs can actually be executing at one time. Hyperthreading helps performance because processors often stall, waiting for memory accesses. When one processor in a hyperthreaded set must wait, the other can be executing. Hyperthreading thus enables greater utilization of the processor hardware; the resulting performance gains are said to be anywhere from 5% to 30%, depending on the workload.

One of the resources shared by hyperthreaded processor sets is the memory cache. This sharing has its advantages: if processes running on the two processors are sharing memory, that memory need only be fetched into the cache once. That kind of sharing happens often; shared libraries are one obvious example. The shared cache also makes moving processes between hyperthreaded processors an inexpensive operation, so keeping loads balanced across the system is easier.

The sharing of caches between hyperthreaded processors is also, however, the cause of a vulnerability identified in a heavily trailered report by Colin Percival. The core of the problem is that, by measuring the latency of specific memory accesses, a process can tell whether a given memory location was represented in the processor cache or not. A hostile process can load the cache with its own memory, wait a bit, then run tests to see which locations have been evicted from the cache. From that information, it can make inferences about which memory locations were accessed by the sibling processor in the hyperthreaded set.

Two cooperating processes, running at different privilege levels, could make use of the cache to set up a covert channel for communication. In a highly secured system, these two processes might not be able to talk to each other at all normally. With a covert channel in place, information can be leaked from a privileged level to one less privileged, leading to all kinds of dreadful consequences - for somebody. Most systems, however, are not overly concerned about this sort of covert channel; there are easier ways to deliberately leak information.

Mr. Percival, however, also shows how the vulnerability can be exploited to obtain information from processes which are not cooperating. In particular, he claims that it can be used to steal keys from cryptographic applications. A number of crypto algorithms have data-dependent memory access patterns; an attacker who can watch memory accesses can, for some algorithms, derive the key which was being used. The exploit discussed in the report attacks the OpenSSL key signing algorithm in this way.

The paper makes a number of recommendations on steps which can be taken to mitigate this problem. The simplest is to simply disable hyperthreading; on Linux systems, it is a simple matter of configuring out hyperthreading support or booting with the noht option. Alternatively, the kernel could take care not to schedule potentially unfriendly processes on the same hyperthreaded set. Removing access to a high-resolution clock would make the necessary timing information unavailable, thus defeating such attacks. Cryptographic algorithms could be rewritten to avoid data-dependent memory access patterns. Processors could be redesigned to not share caches between hyperthreaded siblings, or to use a cache eviction algorithm which makes it harder to determine which cache lines have been removed.

The Linux scheduler could certainly be changed to defeat attempted cache-based attacks on hyperthreaded processors, but the chances of that happening are small. There are numerous obstacles to any sort of real-world exploit of this vulnerability. The attacker must be able to run a CPU-intensive program on the target system - without being noticed - and ensure that it remains on the same hyperthreaded processor as the cryptographic process. The data channel is noisy at best, and it will be made much more so by any other processes running on the system. Timing the attack (knowing when the target process is performing cryptographic calculations, rather than doing something else) is tricky. Getting past all these roadblocks is likely to keep a would-be key thief busy for some time.

In other words, there are almost certainly more effective ways of attacking cryptographic applications. Closing this particular hole is unlikely to be worth the trouble, extra complexity in the kernel, and performance impact it would require. So this vulnerability, despite all the press it has obtained, will probably not lead to any changes to the kernel in the near future. Anybody who is truly worried about this problem will be best off simply turning off hyperthreading for now. In the longer term, authors of cryptographic code may find that they need to add avoidance of data-dependent memory access patterns to their arsenal of techniques.

Comments (12 posted)

A new kernel timer API

John Stultz's new core time subsystem was covered on this page back in January. This patch set, which will be submitted soon for inclusion (into -mm), replaces a mess of architecture-specific time implementations with a cleaner, central time subsystem which can take full advantage of hardware time sources. Nishanth Aravamudan would now like to take advantage of the new low-level time code by replacing the kernel timer implementation. This work, if accepted, will lead to the incorporation of a new timer API to be used by kernel code when a function must be called at some point in the future.

In current Linux kernels, internal time (for most purposes) is measured in "jiffies," which are really just a counter which is incremented when each timer interrupt happens. The new time code supersedes jiffies with an absolute, monotonically increasing count of nanoseconds. References to jiffies thus become a call to:

    nsec_t do_monotonic_clock(void);

Using nanoseconds allows kernel code to work with high-resolution time in real-world units. That, in turn, lets kernel developers forget about the (error-prone) conversions between jiffies and real-world time which are currently necessary.

Nishanth's add-on patch changes the timer subsystem to use nanoseconds as well. The current add_timer() and mod_timer() interfaces remain supported, but are deprecated. The new interface for setting (or modifying) a timer is:

    int set_timer_nsecs(struct timer_list *timer, nsec_t expires);
    void set_timer_on_nsecs(struct timer_list *timer, nsec_t expires, 
                            int cpu);

This function will cause the given timer to be set to go off at expires, which is an absolute nanoseconds count. Usually, expires will be calculated by adding the desired delay (in nanoseconds) to whatever do_monotonic_clock() returns.

It's worth noting that this patch changes the meaning of the expires field in the timer_list structure. This field is now represented in an internal "timer intervals" unit, rather than in jiffies. If the old add_timer() and mod_timer() interfaces are used, the expires field will be silently converted to the internal format. Code which performs calculations on expires (by increasing the delay and calling mod_timer(), for example) could be in for a surprise.

This patch also deprecates schedule_timeout(), in favor of these functions:

    nsec_t schedule_timeout_nsecs(nsec_t timeout);
    unsigned long schedule_timeout_usecs(unsigned long usecs);
    unsigned int schedule_timeout_msecs(unsigned int msecs);

All three of these functions will set a timer for the given delay (which is a relative value, not absolute), then call schedule().

Comments (14 posted)

Clusters and distributed lock management

The creation of tightly-connected clusters requires a great deal of supporting infrastructure. One of the necessary pieces is a lock manager - a system which can arbitrate access to resources which are shared across the cluster. The lock manager provides functions similar to those found in the locking calls on a single-user system - it can give a process read-only or write access to parts of files. The lock management task is complicated by the cluster environment, though; a lock manager must operate correctly regardless of network latencies, cope with the addition and removal of nodes, recover from the failure of nodes which hold locks, etc. It is a non-trivial problem, and Linux does not currently have a working, distributed lock manager in the mainline kernel.

David Teigland (of Red Hat) recently posted a set of distributed lock manager patches (called "dlm"), with a request for inclusion into the mainline. This code, which was originally developed at Sistina, is said to be influenced primarily by the venerable VMS lock manager. An initial look at the code confirms this statement: callbacks are called "ASTs" (asynchronous system traps, in VMS-speak), and the core locking call is an eleven-parameter monster:

    int dlm_lock(dlm_lockspace_t *lockspace,
	int mode,
	struct dlm_lksb *lksb,
	uint32_t flags,
	void *name,
	unsigned int namelen,
	uint32_t parent_lkid,
	void (*lockast) (void *astarg),
	void *astarg,
	void (*bast) (void *astarg, int mode),
	struct dlm_range *range);

Most of the discussion has not been concerned with the technical issues, however. There are some disagreements over issues like how nodes should be identified, but most of the developers who are interested in this area seem to think that this implementation is at least a reasonable starting point. The harder issue is figuring out just how a general infrastructure for cluster support can be created for the Linux kernel. At least two other projects have their own distributed lock managers and are likely to want to be a part of this discussion; an Oracle developer recently described the posting of dlm as "a preemptive strike." Lock management is a function needed by most tightly-coupled clustering and clustered filesystem projects; wouldn't it be nice if they could all use the same implementation?

The fact is that the clustering community still needs to work these issues out; Andrew Morton doesn't want to have to make these decisions for them:

Not only do I not know whether this stuff should be merged: I don't even know how to find that out. Unless I'm prepared to become a full-on cluster/dlm person, which isn't looking likely.

The usual fallback is to identify all the stakeholders and get them to say "yes Andrew, this code is cool and we can use it", but I don't think the clustering teams have sufficent act-togetherness to be able to do that.

Clustering will be discussed at the kernel summit in July. A month prior to that, there will also be a clustering workshop held in Germany. In the hopes that these two events will help bring some clarity to this issue, Andrew has said that he will hold off on any decisions for now.

Comments (none posted)

Andrew Morton 2.6.12-rc4-mm1 ?

Andrew Morton 2.6.12-rc4-mm2 ?

Domen Puncer 2.6.12-rc4-kj ?

Greg KH Linux 2.6.11.10 ?

Greg KH Linux 2.6.11.9 ?

Con Kolivas 2.6.11-ck8 ?

Marcelo Tosatti Linux 2.4.31-pre2 ?

Solar Designer Linux 2.4.30-ow3 ?

Willy Tarreau Linux 2.4.30-hf2 - security fix ?

Dave Jones Cachemap for 2.6.12rc4-mm1. Was Re: [PATCH] enhance x86 MTRR handling ?

Arnd Bergmann ppc64: Introduce Cell/BPA platform, v2 ?

Arnd Bergmann ppc64: add spufs user library ?

Arnd Bergmann ppc64: SPU file system ?

Arnd Bergmann ppc64: Add driver for BPA iommu ?

Arnd Bergmann ppc64: Add driver for BPA interrupt controllers ?

Arnd Bergmann ppc64: add a watchdog driver for rtas ?

Arnd Bergmann ppc64: split out generic rtas code from pSeries_pci.c ?

Arnd Bergmann ppc64: add a minimal nvram driver ?

christoph i386: Selectable Frequency of the Timer Interrupt. ?

john stultz new timeofday subsystem (v A5) ?

Nick Piggin improve SMP reschedule and idle routines ?

Dinakar Guniguntala Dynamic sched domains (v0.6) ?

Nishanth Aravamudan new timeofday-based soft-timer subsystem ?

Nishanth Aravamudan move arch-specific timeofday core to asm ?

Nishanth Aravamudan support new soft-timer subsystem on non-NEWTOD archs ?

Nishanth Aravamudan convert sys_nanosleep() to use new soft-timer subsystem ?

Nishanth Aravamudan convert soft-timer subsystem to timerintervals ?

Joe Korty A more general timeout specification ?

Chris Mason packed delta git ?

Tejun Heo scsi: scsi_request_fn() reimplementation ?

Mike Christie add open iscsi netlink interface to iscsi transport class ?

David S. Miller TG3: Add hw coalescing infrastructure. ?

Christoph Lameter [PATCH] NUMA aware allocation of transmit and receive buffers for e1000 ?

Karim Yaghmour A few articles on RTAI/fusion, the next gen RTAI ?

David Teigland dlm: overview ?

David Teigland dlm: core locking ?

David Teigland dlm: lockspaces, callbacks, directory ?

David Teigland dlm: communication ?

David Teigland dlm: recovery ?

David Teigland dlm: configuration ?

David Teigland dlm: device interface ?

David Teigland dlm: debug fs ?

David Teigland dlm: build ?

Robert Love latest inotify. ?

Jesper Juhl CodingStyle adherence part 1 - if else cleanup part 1 ?

Christoph Lameter NUMA aware slab allocator V3 ?

Coywolf Qi Hunt LCA OOM-Killer v2.3 ?

Matt Domsch ppp_mppe: add PPP MPPE encryption module ?

Arthur Kepner "strict" ipv4 reassembly ?

David S. Miller Super TSO ?

Trent Jaeger LSM-IPSec Networking Hooks ?

David Brownell ANNOUNCE: usbutils 0.71 ?

Arnd Bergmann libfs: add simple attribute files ?

Kernel development

Brief items

Kernel release status

Kernel development news

Is hyperthreading dangerous?

A new kernel timer API

Clusters and distributed lock management

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Miscellaneous