Kernel development [LWN.net]

Kernel release status

The current 2.6 prepatch remains 2.6.19-rc5; no prepatches have been released in the last week. Enough patches have found their way into the mainline git repository that a 2.6.19-rc6 release will probably happen before this kernel cycle runs its course.

The current -mm tree is 2.6.19-rc5-mm2. Recent changes to -mm include the fault injection capability (see below), file-based capabilities, and a backport of the ext3 reservation code to ext2.

For 2.6.16 users, Adrian Bunk has released 2.6.16.32 with a number of fixes.

Comments (none posted)

Quote of the week

70% hit a bug
1/7th think it's deteriorating
1/4th think lkml response is inadequate
3/5ths think bugzilla response is inadequate
2/5ths think we have features-vs-stability wrong
2/3rds hit a bug. Of those, 1/3rd remain unfixed
1/5th of users are presently impacted by a kernel bug

Happy with that?

-- Andrew Morton

Comments (11 posted)

Counting on the time stamp counter

The time stamp counter (TSC) is a hardware feature found on a number of contemporary processors. The TSC is a special register which is simply incremented every clock cycle. Since the clock is the fundamental unit of time as seen by the processor, the TSC provides the highest-resolution timing information available for that processor. It can thus be used for a number of applications, such as measuring the exact time cost of specific instructions or operations.

The TSC can also be read quickly (it is just a CPU register, after all), making it of interest for system timekeeping. There are a lot of applications which check the current time frequently, to the point that gettimeofday() is one of the most performance-critical system calls in Linux. By using the TSC to interpolate within the resolution of a coarser clock, the system can give accurate, high-resolution time without taking a lot of time in the process.

That is the idea, anyway. In practice, the TSC turns out to be hard to use in this way. If the CPU frequency changes (as it will on CPUs which can vary their power consumption), the TSC rate will change as well. If the processor is halted (as can happen when it goes idle), the TSC may stop altogether. On multiprocessor systems, the TSCs on different processors may drift away from each other over time - leading to a situation where a process could read a time on one CPU, move to a second processor, and encounter a time earlier than the one it read on the first processor.

These challenges notwithstanding, the Linux kernel tries to make the best use of the TSC possible. The code which deals with the TSC contains a number of checks to try to detect situations where TSC-based time might not be reliable. One of those checks, in particular, compares TSC time against the jiffies count, which is incremented by way of the timer tick. If, after ten seconds' worth of ticks, the number of TSC cycles seen differs from what would have been expected, the kernel concludes that the TSC is not stable and stops using it for time information.

Interesting things happen when the dynamic tick patch is thrown into the mix. With dynamic ticks, the periodic timer interrupt is turned off whenever there's nothing to be done in the near future, allowing the processor to remain idle for longer and consume less power. Once something happens, however, the jiffies count must be updated to reflect the timer ticks which were missed - something which is generally done by obtaining the time from another source. At best, this series of events defeats the test which ensures that the TSC is operating in a stable manner; at worst, it can lead to corrupted system time. Not a good state of affairs.

For this reason, the recently-updated high-resolution timers and dynamic tick patch set includes a change which disables use of the TSC. It seems that the high-resolution timers and dynamic tick features are incompatible with the TSC - and that people configuring kernels must choose between the two. Since the TSC does have real performance benefits, disabling it has predictably made some people unhappy, to the point that some would prefer to see the timer patches remain out of the kernel for now.

In response to the objections, Ingo Molnar has explained things this way:

We just observed that in the past 10 years no generally working TSC-based gettimeofday was written (and i wrote the first version of it for the Pentium, so the blame is on me too), and that we might be better off without it. If someone can pull off a working TSC-based gettimeofday() implementation then there's no objection from us.

Ingo has also posted a test program which demonstrates that time inconsistencies on TSC-based systems are common - at least, when multiple processors are in use.

Arjan van de Ven has suggested a "duct tape" solution which might work well enough "to keep the illusion alive." It involves setting up offsets and multipliers for each processor's TSC. Between the offsets (which could compensate for TSC drift between processors) and the multipliers (which adjust for frequency changes), some semblance of synchronized and accurate TSC-based time could be maintained - as long as the kernel is able to detect TSC-related events and adjust those values accordingly. No code which implements this idea has yet been posted, however.

The conversation faded out with no real conclusion, though, near the end, Thomas Gleixner did note that the complete disabling of the TSC was "overkill." The preferred solution, which he is working on, is to keep the system from going into the dynamic tick mode if there is no other reliable timer available. Once that code has been posted, it may be possible to have the full set: high-resolution timers, dynamic ticks, and fast clocks using the TSC.

Comments (10 posted)

Injecting faults into the kernel

Some kernel developers, doubtless, feel that their systems fail too often as it is; they certainly would not go out looking for ways to make more trouble. Others, however, are most interested in how their code behaves when things go wrong. As your editor recently discovered to his chagrin, error paths tend to be debugged rather less well than the "normal" code. One can try to anticipate possible failures and try to code the right response, but it can be hard to actually test that code. So error-handling paths can be incorrect (or missing) but the code will appear to work - until something blows up.

In an attempt to help test kernel error handling, Akinobu Mita has been working for some time on a framework for injecting faults into a running kernel. By causing things to go wrong occasionally, the fault injection code should help to ensure that error situations are handled - and handled correctly. This mechanism has found its way into 2.6.19-rc5-mm2 where, hopefully, it will be employed by developers to make sure that their code is bulletproof. Hopefully.

The framework can cause memory allocation failures at two levels: in the slab allocator (where it affects kmalloc() and most other small-object allocations) and at the page allocator level (where it affects everything, eventually). There are also hooks to cause occasional disk I/O operations to fail, which should be useful for filesystem developers. In both cases, there is a flexible runtime configuration infrastructure, based on debugfs, which will let developers focus fault injections into a specific part of the kernel.

Your editor built a version of 2.6.19-rc5-mm2 with the fault injection capability turned on. For whatever reason, the configuration system insisted that the locking validator be enabled too; perhaps somebody injected a fault into the config scripts. In any case, the resulting kernel exports a directory (in debugfs) for each of the available fault injection capabilities.

So, for example, the slab allocation capability has a directory failslab. At system boot, failure injection is turned off; slab failures can be enabled by writing an integer value to the failslab/probability file. The value written there will be interpreted as the percent probability that any given allocation will fail; so writing "5" will cause a 5% failure rate. For situations where a failure rate of less than 1% (but greater than zero) is needed, there is a separate interval value which further filters the result. So a 0.1% failure rate could be had by setting interval to 1000 and probability to 100 - preferably in that order. There is also a times variable which puts an upper limit on the number of failures which will be simulated.

As it happens, randomly injecting failures into the kernel as a whole does not necessarily lead to a lot of useful information for a developer, who is probably interested in the behavior of a specific subsystem. There is only so long that one can put up with basic shell commands failing while trying to make something happen in one particular driver. So there are a number of options which can be used to focus the faults on a particular part of the kernel. These include:

task-filter: if this variable is set to a positive value, faults will only be injected when a specially-marked processes are running. To enable this marking, each process has a new flag (make-it-fail) in its /proc directory; setting that value to one will cause faults to be injected into that process.
address-start and address-stop: if these values are set, fault injection will be concentrated on the code found within the address range specified. As long as any entry within the call chain is inside that address range, the fault injection code will consider causing a failure.
ignore-gfp-wait: if this value is set to one, only non-waiting (GFP_ATOMIC) allocations will potentially fail. There is also a ignore-gfp-highmem option which will cause failures not to be injected into high-memory allocations.

Various other options exist; there is also a set of boot options for turning on injection which might be useful for debugging early system initialization. The documentation file has the details. Also found in the documentation directory are a couple of scripts for concentrating faults on a specific command or module.

The end result of all this is a useful tool. One need not just hope that the error recovery paths in a piece of kernel code will just work properly; it is now possible to actually run them and see what happens. This should lead to a better tested, more robust kernel in the near future, and that can only be a good thing.

Comments (6 posted)

Toward a free Atheros driver

The Atheros family of wireless chipsets finds its way into a number of network adapters and laptop systems. It is a flexible and capable device, with one little limitation: there is no free Linux driver available. Linux support can be had via the freely-downloadable MadWifi driver, but, at the core of that driver, there is a binary-only "hardware access layer" (HAL) module which does much of the real work. This module has all of the problems associated with proprietary drivers: it cannot be audited or fixed, it cannot be improved, it is only available for the kernel versions and architectures supported by the manufacturer, etc. But, for Linux users, the choices are MadWifi or nothing.

A free Atheros HAL module called "ar5k," written by Reyk Floeter, has been in circulation for a couple of years; OpenBSD uses it. But this code has long been followed by allegations that it was improperly developed and potentially subject to copyright claims by Atheros. In the current climate, nobody wants to risk bringing possibly tainted code into the kernel; the potential consequences are just too severe. So, while the desire to support Atheros devices in Linux remains strong, the existing HAL has not been considered and little work has been done to bring that about.

Except that, as it turns out, work has been quietly happening in an unexpected place. The Software Freedom Law Center was asked by the ar5k developers to look at the development history of the code and come up with a pronouncement on whether it was legitimate (from a copyright law perspective) or not. On November 14, the SFLC produced its answer:

SFLC has made independent inquiries with the OpenBSD team regarding the development history of ar5k source. The responses received provide a reasonable basis for SFLC to believe that the OpenBSD developers who worked on ar5k did not misappropriate code, and that the ar5k implementation is OpenBSD's original copyrighted work.

This finding should clear the way for the entry of the free Atheros HAL into the Linux kernel - eventually. But there are a couple of problems which need to be overcome first.

One of those is the general level of upheaval in the Linux wireless subsystem. The developers still intend to move over to the Devicescape stack and to get that code into the mainline, but there is still work to be done in that area. But a new wireless driver which does not work with Devicescape will have a harder path into the kernel. There is an effort to move MadWifi over to Devicescape (it's called "DadWifi"), so that might be the quickest path for Atheros support to get into the kernel.

The other problem, however, is that code based on the HAL concept tends to be unpopular at best. A HAL is typically seen as an unnecessary abstraction layer between the driver and the hardware which serves to obscure what's really going on while adding no real value of its own. So developers who propose HAL-based drivers are usually told to go away and come back once the HAL is gone. There is no real reason to expect things to happen differently this time around.

But, even if it can't be used directly, the ar5k code is now fair game for reference and eventual adaptation into a Linux driver. There are enough developers out there with an interest in making Atheros adapters work that the chances of this work getting done in the (relatively) near future are relatively good. The list of devices which are not supported by Linux is about to get shorter.

Comments (8 posted)

Andrew Morton 2.6.19-rc5-mm2 ?

Adrian Bunk Linux 2.6.16.32 ?

Adrian Bunk Linux 2.6.16.32-rc1 ?

Vivek Goyal [RFC] [PATCH 0/16] x86_64: Relocatable bzImage Support (V2) ?

Alan Cox HZ: 300Hz support ?

Eric W. Biederman sysctl: Undeprecate sys_sysctl (take 2) ?

Matthew Wilcox Introduce mutex_lock_timeout ?

Tejun Heo direct-io: unify asyn/sync completion paths and fix completion bugs ?

Sebastien Dugue - AIO completion signal notification ?

Evgeniy Polyakov kevent: Generic event handling mechanism. ?

Thomas Gleixner [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 ?

Suleiman Souhlal Make the TSC safe to be used by gettimeofday(). ?

Suleiman Souhlal Introduce a vmonotonic_clock() vsyscall. ?

Gautham R Shenoy Cpu-Hotplug: Use per subsystem hot-cpu mutexes. ?

Christoph Lameter sched_domain balancing via softirq V4 ?

Junio C Hamano GIT 1.4.3.5 ?

Junio C Hamano GIT 1.4.4 ?

Marco Costalba qgit-1.5.3 ?

Josef Sipek Git Queues 0.10 ?

David Brownell arch-neutral GPIO calls ?

Stelian Pop Apple Motion Sensor driver ?

Ivo van Doorn RFkill - Add support for input key to control wireless radio ?

Matthew Wilcox Asynchronous scanning for FC/SAS version 3 ?

Paul Moore Generic Netlink HOW-TO based on Jamal's original doc ?

David Howells Permit filesystem local caching and NFS superblock sharing ?

Adrian Bunk the scheduled removal of the frame diverter ?

Wu Fengguang Adaptive readahead V16 ?

Thomas Graf packet mark & fib rules work ?

Gerrit Renker [NET]: Supporting UDP-Lite (RFC 3828) in Linux ?

YOSHIFUJI Hideaki IPv6 Updates ?

Johannes Berg wireless notes / pre d80211 merge ?

Zack Weinberg Syslog permissions, revised ?

James Morris SELinux updates for 2.6.20 ?

James Morris SELinux: Add support for DCCP ?

Serge E. Hallyn security: introduce file caps ?

Avi Kivity ANNOUNCE: new kvm userspace release ?

Kirill Korotaev BC: resource beancounters (v6) (with userpages reclamation + configfs) ?

Balbir Singh RSS controller for containers ?

Etienne Lorrain Gujin PC graphic bootloader version 1.6 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quote of the week

Counting on the time stamp counter

Injecting faults into the kernel

Toward a free Atheros driver

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous