Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.22-rc7, released by Linus on July 1. "It's hopefully (almost certainly) the last -rc before the final 2.6.22 release, and we should be in pretty good shape. The flow of patches has really slowed down and the regression list has shrunk a lot." This is the last chance to test 2.6.22 and find bugs before they slip into the final release.

As of this writing, about 60 patches have been merged into the mainline git repository after -rc7. They are mostly fixes, but there is also the removal of a large set of private ioctl() functions from the libertas (OLPC) wireless driver.

The current -mm tree is 2.6.22-rc6-mm1. Anybody wanting to build and test this tree should certainly read Andrew's notes at the top of the announcement. Recent changes to -mm include kgdb support for several architectures, tickless support for the x86_64 architecture, the ability to force-enable the HPET timer even when the BIOS leaves it disabled, an updated file POSIX capabilities patch, and Intel IOMMU support.

Comments (none posted)

Kernel development news

Quote of the week

SELinux orders a beer object
AppArmor order a /beer
Hilary says "You are both under 21 you can't"
SELinux orders a shandy object
AppArmor orders a /shandy

SELinux is refused because the shandy mixer opened a beer object and shandy inherited beer typing
AppArmor gets drunk because /shandy and /beer are clearly different

-- Alan Cox

Comments (6 posted)

OLS: Three talks on power management

Len Brown can only be a glutton for punishment; he is, after all, the maintainer of the Linux ACPI subsystem. That is a difficult position to be in: ACPI involves getting into the BIOS layer, an area of system software which is not always known for careful, high-quality work. Supporting ACPI is a complex task which, among other things, requires the embedding of a specialized interpreter within the kernel, a hard sell at best. Even with

that background in mind, one must wonder just how much masochism is required to lead one to deliver three separate talks at the 2007 Ottawa Linux Symposium. That is just what Len did, however; the end result was a good view into several aspects of the power management problem.

Getting more from tickless

The first talk (on the tickless kernel) was supposed to be given by Suresh Siddha, who was unable to attend the event. The dynamic tick patches have been covered here before. Suresh/Len's talk was not really about how these patches work, but, instead, about the work which remains to be done to take full advantage of the tickless design. It seems that the work which has been done so far is just the beginning.

The problem is that, on a system used by Suresh and company, the average processor sleep time was still less than 1ms even after the dynamic tick code was enabled. Given that one of the driving reasons for dynamic tick was to let the processor sleep for long periods of time - thus saving power - this is a disappointing result. It turns out that there is a lot which can be done to improve the situation, though.

Step number one is to address a kernel-space problem: there are a lot of active kernel timers which tend to spread out over time. As a result, the kernel wakes up much more often than it would if the timers were sufficiently well coordinated to expire at the same time whenever possible. As it happens, many kernel timers do not need great precision; a timer which fires some number of milliseconds later than scheduled is not a problem. So, if the kernel could defer some timers to fire at the same time as others, it can reduce the number of wakeups. The deferrable timers patch does exactly that; the round_jiffies() function added in 2.6.19 can also help the kernel line up events. Adding this code brought the average sleep time up to 20ms, with the system handling 90 interrupts per second.

Next is the problem of hardware timers. On the i386 architecture, the preferred timer is the local APIC (LAPIC) timer, which is built into the processor and very fast to program. Unfortunately, putting the processor into a deep sleep also puts the LAPIC timer to sleep, a situation Len compared to unplugging one's alarm clock before going to bed. In either case, oversleeping can be the unwanted result. The programmable interval timer (PIT) remains awake and is easily used, but it has a maximum event time of 27ms. If one wants the processor to sleep for longer than that, another solution must be found. That solution is the high-precision event timer (HPET), which has a maximum interval of at least three seconds. Getting access to the HPET can be hard, though; good BIOS support is spotty at best and the HPET is often disabled. If it can be forced on, however, the system can go to an average sleep period of about 56ms, handling 32 interrupts per second.

Even better is to get the HPET out of the "legacy mode" currently used by Linux. This mode is simple to use, but it requires the rebroadcasting of timer interrupts on multiprocessor systems. But the HPET can work with per-CPU channels, eliminating this problem. The result: average sleep time grows to 74ms.

At this point, the problem moves to user space. Since the release of powertop, there has been a lot of progress in this area; user-space applications which cause frequent wakeups unnecessarily stand out immediately and can be fixed. But, as Len noted, "user space still sucks."

ACPI myths

One gets the sense that Len is a little tired of people complaining about ACPI in Linux. His response was a talk on "ten ACPI myths" - though the list had grown to twelve by then.

#1: There is no benefit to enabling ACPI. Len's answer to this had two parts, the first of which being that, increasingly, there is no alternative. The older APM interface is deprecated, and, in particular, Microsoft's Vista has removed APM support altogether. So, soon, there will be no hardware support for APM at all; it is a dead standard. The MPS standard (used for discovering processors) is also old and dying. Like it or not, ACPI is needed to be able to make use of one's hardware.

On the positive side, using ACPI gives better access to hardware features like software-enabled power, sleep, and lid buttons. Smart battery status information becomes available, as well as the potential for reduced power consumption and better battery life. True hotplug and (especially) docking support also become possible with ACPI.

#2: Suspend-to-disk problems are ACPI's fault. In fact, ACPI is a very small part of the suspend-to-disk process - everything else is in other parts of the kernel code. If you have suspend-to-disk problems, suggests Len, "complain to Pavel [Machek], not me."

#3: If the extra buttons don't work, it's ACPI's fault. The issue here is that support for "hotkeys" is not actually a part of the ACPI specification. All of those extra buttons found on laptops are vendor-specific added features. The reverse-engineered drivers currently found in the kernel are a "heroic effort," but they should not be necessary. Vendors should be supplying drivers for their hardware.

#4: Boot problems with ACPI enabled are ACPI's fault. Len allows that this one might just be true some of the time. But disabling ACPI at boot-time also disables other hardware features - the IO-APIC in particular. So any problems associated with those other parts of the system will be masked by turning off ACPI. It looks like ACPI was the actual problem, but the truth is more complicated.

#5: ACPI issues are due to sub-standard platform BIOS. It turns out that there are three general sources of ACPI incompatibilities. Just one of them is the BIOS violating the ACPI specification; incompatibilities which don't break Windows will often slip through the testing process. The firmware developer kit produced by Intel can help in this regard. Another source of problems is differing interpretations of the specification, which is a long and complex document. The Linux ACPI developers have been working to help clarify the specification when this sort of problem arises. Finally, there can also simply be bugs in the Linux-specific code.

#6: The Linux community cannot help to improve the ACPI specification. In fact, the ACPI team has been submitting improvements, mostly in the form of "specification clarifications." Many of those have been incorporated and shipped with specification updates.

#7: The ACPI code changes a lot but is not getting better. Intel has put together a test suite with over 2000 tests; ACPI changes must now pass that suite before being merged. The number of new bug reports has been dropping - though, perhaps, more slowly than one might like.

#8: ACPI is slow and bad for high-performance CPU governors. The ACPI interpreter is not used in performance-critical paths, and, thus, cannot be slowing things down. ACPI's role is in the setup and configuration process.

#9: The speedstep-centrino governor is faster than acpi-cpufreq. The acpi-cpufreq governor has seen considerable improvements, and is now able to access MSRs in a fast and (more importantly) supportable way. So its performance is where it should be, and the speedstep-centrino governor is scheduled for removal.

#10: More CPU idle power states is better. This may be true for any given processor, but you cannot compare processors on the basis of how many idle states they provide. All that really matters is how much power you save when you use those states.

#11: Processor throttling will save energy. The problem here is a confusion of "power" and "energy." A throttled processor may draw less power, but it has to run longer to accomplish the same work. So throttling the processor (while maintaining the same voltage) may have the effect of increasing energy use rather than reducing it. The better approach is almost always to run at the fastest clock frequency afforded by the current voltage level and get the work done quickly; Len characterized this as the "race to idle."

There are second-order effects to consider; in particular, batteries will last longer if they are discharged over longer periods of time. A throttled processor may also run cooler, allowing fans to be turned off. Throttling may be necessary for temperature regulation. But, from an energy-savings perspective, these are truly second-order effects.

#12: I can't contribute to improving ACPI in Linux. Like any other project, Linux ACPI would love to have more developers. And, failing that, one can always test kernels and report bugs. There is, in reality, plenty of opportunity for improving the ACPI code.

Cool-hand Linux

Len's final talk moved away from power consumption toward its effects - the generation of heat, in particular. The creation of excess heat is not a welcome behavior in any device, but it becomes especially undesirable in handheld devices. Devices which make the user's hand sweat are less fun to use; those which get too hot to hold comfortably can be entirely unusable. So temperature management is important. But the nature of these devices can make thermal regulation tricky: there's no room for fans in a Linux-powered cellular phone, and the dissipation of heat can be hard in general.

The ACPI 3.0 specification includes a complicated thermal model. The device is divided up into zones, and each component has its thermal contribution to each zone characterized. Implementing this specification is a complex and difficult task - enough so that the Linux ACPI developers have no intention of doing it. They will, instead, focus on something simpler.

That something is the ACPI 2.0 thermal model. It includes thermal zones, each of which comes with temperature sensors and a set of trip points. The "critical shutdown" trip point is set somewhere just short of where the device begins to melt; should things get that warm, the device just needs to turn itself off as quickly as possible. Various other trip points will be encountered first; they should bring about increasingly strong measures for controlling temperature. These can include turning on fans (if they exist), throttling devices, or suspending the system to disk. ACPI 2.0 includes an embedded controller which monitors the system's temperature sensors and sends events to the CPU when something interesting happens.

The in-progress thermal regulation code just uses the existing critical shutdown mechanism built into ACPI. There is also support for some of the passive trip points which bring about CPU throttling. For the non-processor thermal zones, though, the best thing to do is to let user space figure out how to respond, so that's what the ACPI code will do. There will be a netlink interface through which temperature events can be sent, and a set of sysfs directories for reading sensor values. The sysfs tree will also include control files which can be used by a user-space daemon to throttle specific devices in response to temperature events.

In the end, the kernel is really just a conduit, conducting events and control settings between the components of the device and user space. There were some questions on whether there will be a standardized set of sysfs knobs for every device; the answer appears to be "no." Each device is different, with its own control parameters; it is hard to create any sort of standard which can describe them all. Beyond that, the target environment is embedded devices, each of which is unique. It is expected that each device will have its own special-purpose management daemon designed especially for it, so there is no real benefit in trying to make things generic.

The impression one gets from all these talks is that quite a bit is happening in the power management area - a part of Linux which, for some time, has been seen as falling short of what it really needs to be. The increasing use of Linux in embedded systems can only help in this regard; there are a number of vendors who have a strong interest in improved support for intelligent use of power. Given time and continued work, power management may soon be one of those past problems which is no longer an issue.

Comments (9 posted)

CFS group scheduling

Ingo Molnar's completely fair scheduler (CFS) patch continues to develop; the current version, as of this writing, is v18. One aspect of CFS behavior is seen as a serious shortcoming by many potential users, however: it only implements fairness between individual processes. If 50 processes are trying to run at any given time, CFS will carefully ensure that each gets 2% of the CPU. It could be, however, that one of those processes is the X server belonging to Alice, while the other 49 are part of a massive kernel build launched by Karl the opportunistic kernel hacker, who logged in over the net to take advantage of some free CPU time. Assuming that allowing Karl on the system is considered fair at all, it is reasonable to say that his 49 compiler processes should, as a group, share the processor with Alice's X server. In other words, X should get 50% of the CPU (if it needs it) while all of Karl's processes share the other 50%.

This type of scheduling is called "group scheduling"; Linux has never really supported it with any scheduler. It would be nice if a "completely fair scheduler" to be merged in the future had the potential to be completely fair in this regard too. Thanks to work by Srivatsa Vaddagiri and others, things may well happen in just that way.

The first part of Srivatsa's work was merged into v17 of the CFS patch. It creates the concept of a "scheduling entity" - something to be scheduled, which might not be a process. This work takes the per-process scheduling information and packages it up within a sched_entity structure. In this form, it is essentially a cleanup - it encapsulates the relevant information (a useful thing to do in its own right) without actually changing how the CFS scheduler works.

Group scheduling is implemented in a separate set of patches which are not yet part of the CFS code. These patches turn a scheduling entity into a hierarchical structure. There can now be scheduling entities which are not directly associated with processes; instead, they represent a specific group of processes. Each scheduling entity of this type has its own run queue within it. All scheduling entities also now have a parent pointer and a pointer to the run queue into which they should be scheduled.

By default, processes are at the top of the hierarchy, and each is scheduled independently. A process can be moved underneath another scheduling entity, though, essentially removing it from the primary run queue. When that process becomes runnable, it is put on the run queue associated with its parent scheduling entity.

When the scheduler goes to pick the next task to run, it looks at all of the top-level scheduling entities and takes the one which is considered most deserving of the CPU. If that entity is not a process (it's a higher-level scheduling entity), then the scheduler looks at the run queue contained within that entity and starts over again. Things continue down the hierarchy until an actual process is found, at which point it is run. As the process runs, its runtime statistics are collected as usual, but they are also propagated up the hierarchy so that its CPU usage is properly reflected at each level.

So now the system administrator can create one scheduling entity for Alice, and another for Karl. All of Alice's processes are placed under her representative scheduling entity; a similar thing happens to all of the processes in Karl's big kernel build. The CFS scheduler will enforce fairness between Alice and Karl; once it decides who deserves the CPU, it will drop down a level and perform fair scheduling of that user's processes.

The creation of the process hierarchy need not be done on a per-user basis; processes can be organized in any way that the administrator sees fit. The grouping could be coarser; for example, on a university machine, all students could be placed in one group and faculty in another. Or the hierarchy could be based on the type of process: there could be scheduling entities representing system daemons, interactive tools, monster cranker CPU hogs, etc. There is nothing in the patch which limits the ways in which processes can be grouped.

One remaining question might be: how does the system administrator actually cause this grouping to happen? The answer is in the second part of the group scheduling patch, which integrates scheduling entities with the process container mechanism. The administrator mounts a container filesystem with the cpuctl option; scheduling groups can then be created as directories within that filesystem. Processes can be moved into (and out of) groups using the usual container interface. So any particular policy can be implemented through the creation of a simple, user-space daemon which responds to process creation events by placing newly-created processes in the right group.

In its current form, the container code only supports a single level of group hierarchy, so a two-level scheme (divide users into administrators, employees, and guests, then enforce fairness between users in each group, for example) cannot be implemented. This appears to be a "didn't get around to it yet" sort of limitation, though, rather than something which is inherent in the code.

With this feature in place, CFS will become more interesting to a number of potential users. Those users may have to wait a little longer, though. The 2.6.23 merge window will be opening soon, but it seems unlikely that this work will be considered ready for inclusion at that time. Maybe 2.6.24 will be a good release for people wanting a shiny, new, group-aware scheduler.

Comments (5 posted)

The ongoing fallocate() story

The proposed fallocate() system call, which exists to allow an application to preallocate blocks for a file, was covered here back in March. Since then there has been quite a bit of discussion, but there is still no fallocate() system call in the mainline - and it's not clear that there will be in 2.6.23 either. There is a new version of the fallocate() patch in circulation, so it seems like a good time to catch up with what is going on.

Back in March, the proposed interface was:

    long fallocate(int fd, int mode, loff_t offset, loff_t len);

It turns out that this specific arrangement of parameters is hard to support on some architectures - the S/390 architecture in particular. Various alternatives were proposed, but getting something that everybody liked proved difficult. In the end, the above prototype is still being used. The S/390 architecture code will have to do some extra bit shuffling to be able to implement this call, but that apparently is the best way to go.

That does not mean that the interface discussions are done, though. The current version of the patch now has four possibilities for mode:

FA_ALLOCATE will allocate the requested space at the given offset. If this call makes the file longer, the reported size of the file will be increased accordingly, making the allocated blocks part of the file immediately.
FA_RESV_SPACE preallocates blocks, but does not change the size of the file. So the newly allocated blocks, if past the end of the file, will not appear to be present until the application writes to them (or increases the size of the file in some other way).
FA_DEALLOCATE returns previously-allocated blocks to the system. The size of the file will be changed if the deallocated blocks are at the end.
FA_UNRESV_SPACE returns the blocks to the system, but does not change the size of the file.

As an example of how the last two operations differ, consider what happens if an application uses fallocate() to remove the last block from a file. If that block was removed with FA_DEALLOCATE, a subsequent attempt to read that block will return no data - the offset where that block was is now past the end of the file. If, instead, the block is removed with FA_UNRESV_SPACE, an attempt to read it will return a block full of zeros.

It turns out that there are some differing opinions on how this interface should work. A trivial change which has been requested is that the FA_ prefix be changed to FALLOC_ - this change is likely to be made. But it seems there's a number of other flags that people would like to see:

FALLOC_ZERO_SPACE would write zeros to the requested range - even if that range is already allocated to the file. This feature would be useful because some filesystems can quickly mark the affected range as being uninitialized rather than actually writing zeros to all of those blocks.
FALLOC_MKSWAP would allocate the space, mark it initialized, but not actually zero out the blocks. The newly-allocated blocks would thus still contain whatever data the previous user left there. This operation, which would clearly have to be privileged, is intended to make it possible to create a swap file in a very quick way. It would require very little in the way of in-kernel memory allocations to implement, making it a useful way to add an emergency swap file to a system which has gone into an out-of-memory condition.
FALLOC_FL_ERR_FREE would be an additional flag which would affect error handling; in particular, it would control behavior when the filesystem runs out of space part way through an allocation request. If this flag is set, the blocks which were successfully preallocated would be freed; otherwise they would be left in place. There is some opposition to this flag; it may be left out in favor of an official "all or nothing" policy for preallocations.
FALLOC_FL_NO_MTIME and FALLOC_FL_NO_CTIME would prevent the filesystem from updating the modification times associated with the file.

All told, it's a significant number of new features - enough that some people are starting to wonder if fallocate() is the right approach after all. Christoph Hellwig, in particular, has started to complain; he suggests adding something small which would be able to implement posix_fallocate() and no more. Block deletion, he says, is a different function and should be done with a different system call, and the other features need more thought (and aggressive weeding). So it's unclear where this patch set will go and whether it will be considered ready for 2.6.23.

Comments (2 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.22-rc7 ?

Andrew Morton 2.6.22-rc6-mm1 ?

Architecture-specific

Geert Uytterhoeven PS3 Storage Drivers for 2.6.23, take 4 ?

Core kernel code

Davide Libenzi sys_indirect RFC - sys_indirect introduction ?

Steven Rostedt Convert all tasklets to workqueues V3 ?

Development tools

Mark Seger announcing collectl - a new performance monitoring tool ?

Subrata Modak [ANNOUNCE] The Linux Test Project has been Released for June 2007, ?

Mathieu Desnoyers Linux Kernel Markers ?

Satyam Sharma netconsole: Multiple targets and dynamic reconfigurability ?

Device drivers

Rodolfo Giometti LinuxPPS (with new syscalls API) - new version ?

Tom Zanussi Driver Tracing Interface (DTI) code ?

Tom Zanussi Driver Tracing Interface (DTI) Documentation ?

Michal Januszewski uvesafb: a general description ?

Jeff Garzik What's in netdev-2.6.git? ?

Jeff Garzik What's in libata-dev.git? ?

Stephen Hemminger fujtisu application panel driver ?

Holger Schurig add support for Marvell 8385 CF cards ?

Ben Dooks AX88796 network driver ?

Documentation

Alan Cox Libata PATA status ?

Filesystems and block I/O

Mingming Cao Ext4 patches for 2.6.22-rc6 ?

J. Bruce Fields vfs lease api ?

Dan Williams An evolutionary change to the raid456 queuing model ?

Dan Williams [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2 ?

Memory management

Christoph Lameter [PATCH] Containment measures for slab objects on scatter gather lists ?

Davide Libenzi MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness ?

Networking

Miklos Szeredi rewrite AF_UNIX garbage collector ?

PJ Waskiewicz NET: Multiple queue hardware support ?

Patrick McHardy : Pending netfilter patches ?

Virtualization and containers

Vaidyanathan Srinivasan [RFC][PATCH 0/3] Containers: Integrated RSS and pagecache control v5 ?

Martin Schwidefsky resend: guest page hinting version 5. ?

Rusty Russell Virtio draft IV ?

Miscellaneous

Nigel Cunningham Suspend2 is getting a new name. ?

Pablo Neira Ayuso Release conntrack-tools 0.9.4 ?

Karel Zak util-linux-ng 2.13-rc1 ?

Page editor: Jonathan Corbet
Next page: Distributions>>