User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.23-rc2, released by Linus on August 3. "So I tried to hold people to the merge window, and said no to a few pull requests, but this whole '-rc2 is the new -rc1' thing is a disease, and not only is -rc2 late, it's bigger than it should be. Oh, well." Along with a whole lot of fixes, -rc2 adds extensive documentation to the Lguest code, a mechanism where kernel-space code can request notification when it is about to be preempted from the CPU, new configuration options for software suspend and hibernation, the removal of support for SuperH sh73180 and 7300 CPUs, AMD Geode LX framebuffer support, the removal of the arm26 port, and a TCP congestion control API change. See the short-form changelog for details or the full changelog for lots of details.

Roughly 50 changesets have been merged into the mainline repository since -rc2.

The current stable 2.6 kernel remains The update is in review as of this writing, and may be released as soon as August 9. It contains 84 fixes for problems all over the tree.

For older kernels: was released on August 4 with a fair number of important fixes.

Comments (none posted)

Kernel development news

Quotes of the week

I don't doubt at all that virtualization is useful in some areas. What I doubt rather strongly is that it will ever have the kind of impact that the people involved in virtualization want it to have. It would appear that virtualization is the "message-passing microkernel" of this decade, and that people have a really hard time accepting that the reason operating systems still basically look 100% the same today as they did almost forty years ago, is that that is simply a very practical arrangement!
-- Linus Torvalds

In Linux we never ever assume a driver is working simply because the hardware vendor tested it. A decade of real world experience PROVES precisely the opposite -- getting code out into the world early and often repeatedly turned up problems not seen in hardware vendor's testing.
-- Jeff Garzik

Comments (12 posted)

Tripping over trip points

By Jonathan Corbet
August 7, 2007
Contemporary processors have an interesting problem: if they operate at their full rated capacity for extended periods of time, they run a real risk of heating to the point that they let the blue smoke out and never run again. To avoid this kind of problem, processors (and other components) are instrumented with temperature sensors. The BIOS programs the sensors with specific "trip points" - temperatures where things will happen to keep the system from overheating. At a given trip point, the system might turn on the fan, throttle the processor, or, if disaster is imminent, shut the system down hard.

The Linux ACPI subsystem provides the ability to query these trip points; the relevant virtual files can be found under /proc/acpi/thermal_zone. Your editor's laptop, for example, reveals that it is set to throttle the processor at 86°C and to pull the plug at 91°. Traditionally, the ACPI code has also allowed a suitably privileged user to change those trip points by writing new values to the /proc files. That capability no longer exists, though; it was removed in the 2.6.22 kernel.

Users are now starting to complain about that change. They feel that the BIOS-set trip points on some systems are positioned incorrectly, resulting in systems that run more slowly than they think they should, fans which come on at the wrong time, and so on. Naturally, they feel that the removal of the trip-point override feature has reduced the functionality of their systems.

ACPI maintainer Len Brown responds that the override feature is a bad idea for a number of reasons. At the top of the list is the fact that the system cannot actually change the hardware trip points. All it can do is disable them. Then the processor must take over by polling the temperature sensors itself and responding when its software trip points are reached. Should that polling and response fail to happen for any reason, there is a real possibility that the hardware could be damaged. Meltdowns could also easily occur if the trip points are set incorrectly, leading to "Linux destroyed my laptop" postings echoed across the net.

On top of that, the BIOS can change the trip points at any time for reasons of its own. Many of the use cases for trip-point overrides (controlling when fans go on and off, for example) are better done by having a user-space daemon control fan operation directly. And the truth of the matter is that overriding trip points is usually (Len would say always) an inappropriate response to problems which are better solved somewhere else. When the issue was discussed in May, he summarized it this way:

The fact that the trip-points are writable has obscured, rather than clarified, the actual causes of the failures. No less than 4 people in that bug report declared that cleaning the dust out of their fan fixed the root cause. A bunch more said that the issues went away when they stopped using ubuntu's user-space power save daemon.

There are a couple more with broken active fan control -- which also gets obscured rather than clarified by over-riding trip points.

The remaining problems, says Len, are most likely not present when Windows is running on the affected hardware. And, he says, Windows is highly unlikely to be overriding the trip points. The conclusion is that Linux is doing something wrong in its thermal management on those systems. He would much rather find and fix the real problem than hide it through use of trip-point overrides.

In the end, according to Len, there has never yet been a bug report which suggests that Linux should be messing with trip points in this way. This is a clear challenge for anybody who misses the trip-point override feature: send in a suitably documented report showing the problem that this feature solved. If the override feature truly turns out to be necessary, it may just come back - but it may just happen that a fix for the actual problem goes in instead.

Comments (5 posted)

Smack for simplified access control

By Jake Edge
August 8, 2007

SELinux provides a comprehensive security solution for Linux, but it is large and complex. A much simpler approach is taken by the Simplified Mandatory Access Control Kernel (Smack), a patch posted to linux-kernel by Casey Schaufler. Like SELinux, Smack implements Mandatory Access Control (MAC), but it purposely leaves out the role based access control and type enforcement that are major parts of SELinux. Smack is geared towards solving smaller security problems than SELinux, requiring much less configuration and very little application support.

Smack allows an administrator to define labels, 1-7 characters in length, for kernel objects. Labels on objects are compared with the labels of a task that tries to access them. By default, access is only allowed if the labels match. There are a set of Smack-reserved labels that follow a different set of rules, which allows most system objects and processes to be unaffected by Smack restrictions. By default, Smack does not get in the way of the OS, allowing the administrator to concentrate on just the users and processes they want to secure.

Smack uses filesystem extended attributes to store labels on files; administrators set the labels using the attr command. The security.SMACK64 attribute is used to store the Smack label on each file, so setting /dev/null to have the Smack-reserved "star" label would look like:

    attr -S -s SMACK64 -V '*' /dev/null
For networks, NetLabel is used to set CIPSO labels and domains of interpretation for sockets, allowing Smack systems to interoperate in those strictly controlled networking environments.

An administrator can add rules, but there is no support for wildcards or regular expressions; each rule must specify a subject label, object label and the access allowed explicitly. The access types are much like the traditional UNIX rwx bits, with the addition of an a bit for append. For configuration, Smack uses the SELinux technique of defining a filesystem that can be mounted, smackfs. Typically, it will be mounted as /smack, providing various files that can be read or written, to govern Smack operation. For example, Smack access rules are written to /smack/load; to change rules, one just writes a new set of access permissions for the subject-object pair.

An example, one of several provided in the patch announcement, uses the standard security levels for government documents. Smack labels are defined for each level: Unclass for unclassified, C for classified, S for secret, and TS for top secret. Then, with a handful of rules:

        C        Unclass       rx
        S        C             rx
        S        Unclass       rx
        TS       S             rx
        TS       C             rx
        TS       Unclass       rx
the traditional hierarchy of access is defined. Because of the Smack defaults, Unclass will only be able to access data with that same label, whereas because of the rules above, TS can access S, C and Unclass data.

Note that there is no transitivity in Smack rules, just because S can access C and TS can access S, that does not mean that TS can access C. That rule must be explicitly given. Also, because no write permissions have been given, tasks at each level can only write data with their own label. So secret tasks write secret data and so on. Files will inherit the label of the task that creates them, with Smack ensuring that the filesystem attribute is set. They will retain that label unless it is explicitly reset by an administrator using the attr command.

A patched version of sshd is available from Schaufler's homepage which allows an administrator to assign labels to users. Those labels get set on the user's shell and terminal device as they log into the system, forcing the user to follow the rules established for their label. A patched version of ls is also available so that it can display the labels associated with files.

Smack is useful for limiting user and specific process access to various resources, it is not meant to be as general purpose as SELinux. Constructing a set of Smack labels and rules governing system processes, network services and the like, to restrict their access as SELinux does, would be impossible. For administrators needing to secure those services, SELinux is probably a better tool, but for simple compartmentalization, Smack may well suffice.

Comments (1 posted)

Newer, newer NAPI

By Jonathan Corbet
August 7, 2007
Last December, LWN looked at a proposal to rework the NAPI interface used for packet reception in high-bandwidth network drivers. Since then, the interface has gone through some changes, but now appears to be in something close to its final form. Anybody who maintains a NAPI-capable network driver will need to adapt to the new API; in many cases the changes will be simple, but New-NAPI offers some added features which may be of value to drivers of complicated hardware.

The core idea behind the NAPI interface is that, on a busy network, the kernel does not need to be interrupted every time a network packet arrives. Instead, the kernel can simply poll occasionally in the sure knowledge that packets will be there waiting. Your editor likes to compare packet receive interrupts with the beeps we all had, once upon a time, to let us know when email had arrived. Few of us use those beeps anymore; we have no doubt that there will be email waiting whenever we see fit to look for it. Like us, the kernel can do without unneeded distractions; that is especially true when those distractions can take the form of thousands of interrupts every second.

There are other advantages to the NAPI approach. If the networking subsystem is overwhelmed and must drop packets, NAPI makes it possible for them to be dropped before they are ever fed into the stack. For various reasons, packet reordering tends to be less of a problem with NAPI as well.

The new napi_struct patch set (currently at version 5), like its predecessor, introduces a new structure for controlling packet reception:

    struct napi_struct {
	struct list_head	poll_list;
	unsigned long		state;
	int			weight;
	int			quota;
	int			(*poll)(struct napi_struct *, int);
	/* Netpoll-related fields omitted */

This structure is no longer part of the net_device structure, though; instead, drivers are expected to allocate it separately. Usually it will be part of whatever larger structure the driver uses to represent the device internally. One of the main advantages of this approach is that device drivers can, if need be, create more than one napi_struct structure for a given device. Contemporary hardware can support multiple receive queues with nifty features like CPU affinity and flow separation; multiple NAPI structures makes it easier to use those queues efficiently.

Drivers need not fill in the fields of the napi_struct structure, though zeroing the whole structure at allocation time can only be a good idea. Instead, each NAPI instance must be registered with the system with:

    void netif_napi_add(struct net_device *dev,
                        struct napi_struct *napi,
			int (*poll)(struct napi_struct *, int),
			int weight);

Here, dev is the net_device structure associated with the interface, napi is the NAPI structure, poll() is the polling method to be used with this instance, and weight is the relative weight to be given to this interface. Note that poll() and weight are no longer part of the net_device structure. As always, the setting of weight is somewhat arbitrary, with most values varying between 16 (for basic Ethernet) and 64 - though InfiniBand uses 100. There is talk of reworking weights in a future patch, but that is a separate issue.

There is no netif_napi_remove(), as there is currently no need for it.

The prototype of the poll() method has changed somewhat:

    int (*poll)(struct napi_struct *napi, int budget);

The NAPI structure comes in as napi, of course. The budget parameter specifies how many packets the driver is allowed to pass into the network stack on this call. There is no need to manage separate quota fields anymore; drivers should simply respect budget and return the number of packets which were actually processed.

Most of the other NAPI-related functions have had the obvious changes made to their prototypes. The two ways of turning on polling are:

    void netif_rx_schedule(struct net_device *dev, 
                           struct napi_struct *napi);
    /* ...or... */
    int netif_rx_schedule_prep(struct net_device *dev,
			       struct napi_struct *napi);
    void __netif_rx_schedule(struct net_device *dev,
		       	     struct napi_struct *napi);

Polling is turned off with:

    void netif_rx_complete(struct net_device *dev,
			   struct napi_struct *napi);

Since there can be more than one napi_struct structure in existence, each can have polling enabled independently. Drivers are responsible for disabling polling on all outstanding NAPI structures when the interface is shut down (or when its stop() method is called).

The netif_poll_enable() and netif_poll_disable() functions no longer exist, since polling is no longer tied to the net_device structure. Instead, these functions should be used:

    void napi_enable(struct napi *napi);
    void napi_disable(struct napi *napi);

Networking maintainer David Miller, who has taken on the development of this patch, says:

I don't anticipate making any more changes, just fixing bugs. So please help me with that so we can finalize this patch. I intend to cut a net-2.6.24 tree and stuff this patch into it by the end of the week.

So anybody charged with maintaining out-of-tree network drivers should be prepared for a significant API change in the 2.6.24 kernel.

Comments (1 posted)

Once upon atime

By Jonathan Corbet
August 8, 2007
Among the metadata maintained by most filesystems is the last file access time, or "atime." This time can be a useful value to have - it lets an administrator (or a program) know when a file was last used. There is a strong downside to this feature, though: it forces a write to the disk every time a file is accessed. So read-only operations, which might have been satisfied entirely from cache, turn into filesystem writes to keep the atime value up to date.

A recent discussion on write throttling turned to atime after Ingo Molnar pointed out that atime was probably a bigger performance problem than just about everything else. He went on to say:

Atime updates are by far the biggest IO performance deficiency that Linux has today. Getting rid of atime updates would give us more everyday Linux performance than all the pagecache speedups of the past 10 years, _combined_.

He also claimed that it was "perhaps the most stupid Unix design idea of all times."

Such discussion leads quickly to the question of what should be done about this old situation. One step that any Linux user can take now is to mount filesystems with the noatime option, which turns off the tracking of access times. For filesystem-intensive tasks, the performance reward can be immediately apparent. Unfortunately, turning off atime unconditionally will occasionally break software. Some mail tools will compare modification and access times to determine whether there is unread mail or not. The tmpwatch utility and some backup tools also use atime and can misbehave if atime is not correct. For this reason, distributors tend not to make noatime the default on installed systems.

Another approach was added in 2.6.20: the relatime mount option. If this flag is set, access times are only updated if they are (before the update) earlier than the modification time. This change allows utilities to see if the current version of a file has been read, but still cuts down significantly on atime updates. This option is not heavily used, perhaps because few people have heard of it and many distributions lack a version of mount which is new enough to know about it. Using relatime can still confuse tools which want to ask questions like "has this file been accessed in the last week?"

To fix that problem, Linus suggested a tweak to how relatime works: update it if the current value is more than a certain time in the past - one day, for example. Ingo responded with a patch implementing that behavior and adding a couple of new boot options: relatime_interval, which specifies the update interval in seconds, and default_relatime, which turns on the relatime option in all filesystems by default.

Something resembling this version of the patch might go into 2.6.24. It was suggested that, whenever a file's inode is to be written to disk anyway, the kernel might as well update atime as well. Alan Cox objected that this change might make the overall behavior less predictable, which might not be desirable. No new version of the patch with this feature has been posted, so chances are it will not be in the version which gets merged - if and when that happens.

Comments (14 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O


Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds