Brief items
The current 2.6 prepatch is 2.6.23-rc2,
released by Linus on
August 3. "
So I tried to hold people to the merge window, and
said no to a few pull requests, but this whole '-rc2 is the new -rc1' thing
is a disease, and not only is -rc2 late, it's bigger than it should be. Oh,
well." Along with a whole lot of fixes, -rc2 adds extensive
documentation to the Lguest code, a mechanism where kernel-space code can
request notification when it is about to be preempted from the CPU, new
configuration options for software suspend and hibernation, the removal of
support for SuperH sh73180 and 7300 CPUs, AMD Geode LX framebuffer support,
the removal of the arm26 port, and a TCP congestion control API change.
See
the short-form changelog for details or
the
full changelog for lots of details.
Roughly 50 changesets have been merged into the mainline repository since
-rc2.
The current stable 2.6 kernel remains 2.6.22.1. The 2.6.22.2 update is in review as of this
writing, and may be released as soon as August 9. It contains 84
fixes for problems all over the tree.
For older kernels: 2.6.21.7 was released on
August 4 with a fair number of important fixes.
Comments (none posted)
Kernel development news
I don't doubt at all that virtualization is useful in some
areas. What I doubt rather strongly is that it will ever have the
kind of impact that the people involved in virtualization want it
to have. It would appear that virtualization is the
"message-passing microkernel" of this decade, and that people have
a really hard time accepting that the reason operating systems
still basically look 100% the same today as they did almost forty
years ago, is that that is simply a very practical arrangement!
--
Linus Torvalds
In Linux we never ever assume a driver is working simply because
the hardware vendor tested it. A decade of real world experience
PROVES precisely the opposite -- getting code out into the world
early and often repeatedly turned up problems not seen in hardware
vendor's testing.
--
Jeff Garzik
Comments (12 posted)
By Jonathan Corbet
August 7, 2007
Contemporary processors have an interesting problem: if they operate at
their full rated capacity for extended periods of time, they run a real
risk of heating to the point that they let the blue smoke out and never run
again. To avoid this kind of problem, processors (and other components)
are instrumented with temperature sensors. The BIOS programs the sensors
with specific "trip points" - temperatures where things will happen to keep
the system from overheating. At a given trip point, the system might turn
on the fan, throttle the processor, or, if disaster is imminent, shut the
system down hard.
The Linux ACPI subsystem provides the ability to query these trip points;
the relevant virtual files can be found under
/proc/acpi/thermal_zone. Your editor's laptop, for example,
reveals that it is set to throttle the processor at 86°C and to pull
the plug at 91°. Traditionally, the ACPI code has also allowed a
suitably privileged user to change those trip points by writing new values
to the /proc files. That capability no longer exists, though; it
was removed in the 2.6.22 kernel.
Users are now starting to complain about
that change. They feel that the BIOS-set trip points on some systems are
positioned incorrectly, resulting in systems that run more slowly than they
think they should, fans which come on at the wrong time, and so on.
Naturally, they feel that the removal of the trip-point override feature
has reduced the functionality of their systems.
ACPI maintainer Len Brown responds that the
override feature is a bad idea for a number of reasons. At the top of the
list is the fact that the system cannot actually change the hardware trip
points. All it can do is disable them. Then the processor must take over
by polling the temperature sensors itself and responding when its software
trip points are reached. Should that polling and response fail to happen
for any reason, there is a real possibility that the hardware could be
damaged. Meltdowns could also easily occur if the trip points are set
incorrectly, leading to "Linux destroyed my laptop" postings echoed across
the net.
On top of that, the BIOS can change the trip points at any time for reasons
of its own. Many of the use cases for trip-point overrides (controlling
when fans go on and off, for example) are better done by having a
user-space daemon control fan operation directly. And the truth of the
matter is that overriding trip points is usually (Len would say always) an
inappropriate response to problems which are better solved somewhere else.
When the issue was discussed in May, he summarized it this way:
The fact that the trip-points are writable has obscured, rather
than clarified, the actual causes of the failures. No less than 4
people in that bug report declared that cleaning the dust out of
their fan fixed the root cause. A bunch more said that the issues
went away when they stopped using ubuntu's user-space power save
daemon.
There are a couple more with broken active fan control -- which
also gets obscured rather than clarified by over-riding trip
points.
The remaining problems, says Len, are most likely not present when Windows
is running on the affected hardware. And, he says, Windows is highly
unlikely to be overriding the trip points. The conclusion is that Linux is
doing something wrong in its thermal management on those systems. He would
much rather find and fix the real problem than hide it through use of
trip-point overrides.
In the end, according to Len, there has never yet been a bug report which
suggests that Linux should be messing with trip points in this way. This
is a clear challenge for anybody who misses the trip-point override
feature: send in a suitably documented report showing the problem that this
feature solved. If the override feature truly turns out to be necessary,
it may just come back - but it may just happen that a fix for the actual
problem goes in instead.
Comments (5 posted)
By Jake Edge
August 8, 2007
SELinux provides a comprehensive security solution for Linux, but it is
large and complex. A much simpler approach is taken by the Simplified
Mandatory Access Control Kernel (Smack), a patch posted to linux-kernel by
Casey Schaufler. Like SELinux, Smack implements Mandatory Access Control
(MAC), but it purposely leaves out the role based access control and type
enforcement that are major parts of SELinux. Smack is geared towards
solving smaller security problems than SELinux, requiring much less
configuration and very little application support.
Smack allows an administrator to define labels, 1-7 characters in length,
for kernel objects. Labels on objects are compared with the labels of a
task that tries to access them. By default, access is only allowed if the
labels match. There are a set of Smack-reserved labels that follow a
different set of rules, which allows most system objects and processes to be
unaffected by Smack restrictions. By default, Smack does not get in the
way of the OS, allowing the administrator to concentrate on just the users
and processes they want to secure.
Smack uses filesystem extended
attributes to store labels on files; administrators set the labels
using the attr command. The security.SMACK64 attribute
is used to store the Smack label on each file, so setting
/dev/null to have the Smack-reserved "star" label would
look like:
attr -S -s SMACK64 -V '*' /dev/null
For networks,
NetLabel is used to set CIPSO
labels and domains of interpretation for sockets, allowing Smack systems to
interoperate in those strictly controlled networking environments.
An administrator can add rules, but there is no support
for wildcards or regular expressions; each rule must specify a subject
label, object label and the access allowed explicitly. The access types
are much like the traditional UNIX rwx bits, with the addition of
an a bit for append. For configuration,
Smack uses the SELinux technique of defining a
filesystem that can be mounted, smackfs. Typically, it will be
mounted as /smack, providing various files that can be read or
written, to
govern Smack operation. For example, Smack access rules are written to
/smack/load; to change rules, one just writes a new set of access
permissions for the subject-object pair.
An example, one of several provided in the patch announcement, uses the
standard security levels for government documents. Smack labels are
defined for each level: Unclass for unclassified, C for
classified, S for secret, and TS for top secret. Then,
with a handful of rules:
C Unclass rx
S C rx
S Unclass rx
TS S rx
TS C rx
TS Unclass rx
the traditional hierarchy of access is defined. Because of the Smack
defaults,
Unclass will only be able to access data with that same
label,
whereas because of the rules above,
TS can access
S,
C and
Unclass data.
Note that there is no transitivity in Smack rules, just because S
can access C and TS can access S, that does not mean
that TS can access C. That rule must be explicitly
given. Also, because no write permissions have been given, tasks at each
level can only
write data with their own label. So secret tasks write secret data and so
on. Files will inherit the label of the task that creates them, with Smack
ensuring that the filesystem attribute is set. They will retain that label
unless it is explicitly reset by an administrator using the attr
command.
A patched version of sshd is available from Schaufler's homepage
which allows an administrator to assign labels to users. Those labels get
set on the user's shell and terminal device as they log into the system, forcing the user to
follow the rules established for their label. A patched version of
ls is
also available so that it can display the labels associated with files.
Smack is useful for limiting user and specific process access to
various resources, it is not meant to be as general purpose as SELinux.
Constructing a set of Smack labels and rules governing system processes,
network services and the like, to restrict their access as SELinux does,
would be impossible. For administrators needing to secure those services,
SELinux is probably a better tool, but for simple compartmentalization,
Smack may well suffice.
Comments (1 posted)
By Jonathan Corbet
August 7, 2007
Last December, LWN
looked at a
proposal to rework the NAPI interface used for packet reception in
high-bandwidth network drivers. Since then, the interface has gone through
some changes, but now appears to be in something close to its final form.
Anybody who maintains a NAPI-capable network driver will need to adapt to
the new API; in many cases the changes will be simple, but New-NAPI offers
some added features which may be of value to drivers of complicated hardware.
The core idea behind the NAPI interface is that, on a busy network, the
kernel does not need to be interrupted every time a network packet
arrives. Instead, the kernel can simply poll occasionally in the sure
knowledge that packets will be there waiting. Your editor likes to compare
packet receive interrupts with the beeps we all had, once upon a time, to
let us know when email had arrived. Few of us use those beeps anymore; we
have no doubt that there will be email waiting whenever we see fit to look
for it. Like us, the kernel can do without unneeded distractions; that is
especially true when those distractions can take the form of thousands of
interrupts every second.
There are other advantages to the NAPI approach. If the networking
subsystem is overwhelmed and must drop packets, NAPI makes it possible for
them to be dropped before they are ever fed into the stack. For various
reasons, packet reordering tends to be less of a problem with NAPI as
well.
The new napi_struct patch set (currently at version 5), like its
predecessor, introduces a new structure for controlling packet reception:
struct napi_struct {
struct list_head poll_list;
unsigned long state;
int weight;
int quota;
int (*poll)(struct napi_struct *, int);
/* Netpoll-related fields omitted */
}
This structure is no longer part of the net_device structure,
though; instead, drivers are expected to allocate it separately. Usually
it will be part of whatever larger structure the driver uses to represent
the device internally. One of the main advantages of this approach is that
device drivers can, if need be, create more than one napi_struct
structure for a given device. Contemporary hardware can support multiple
receive queues with nifty features like CPU affinity and flow separation;
multiple NAPI structures makes it easier to use those queues efficiently.
Drivers need not fill in the fields of the napi_struct structure,
though zeroing the whole structure at allocation time can only be a good
idea. Instead, each NAPI instance must be registered with the system with:
void netif_napi_add(struct net_device *dev,
struct napi_struct *napi,
int (*poll)(struct napi_struct *, int),
int weight);
Here, dev is the net_device structure associated with the
interface, napi is the NAPI structure, poll() is the
polling method to be used with this instance, and weight is the
relative weight to be given to this interface. Note that poll()
and weight are no longer part of the net_device
structure. As always, the setting of weight is somewhat
arbitrary, with most values varying between 16 (for basic Ethernet) and 64
- though InfiniBand uses 100. There is talk of reworking weights in a
future patch, but that is a separate issue.
There is no netif_napi_remove(), as there is currently no need for
it.
The prototype of the poll() method has changed somewhat:
int (*poll)(struct napi_struct *napi, int budget);
The NAPI structure comes in as napi, of course. The
budget parameter specifies how many packets the driver is allowed
to pass into the network stack on this call. There is no need to manage
separate quota fields anymore; drivers should simply respect
budget and return the number of packets which were actually
processed.
Most of the other NAPI-related functions have had the obvious changes made
to their prototypes. The two ways of turning on polling are:
void netif_rx_schedule(struct net_device *dev,
struct napi_struct *napi);
/* ...or... */
int netif_rx_schedule_prep(struct net_device *dev,
struct napi_struct *napi);
void __netif_rx_schedule(struct net_device *dev,
struct napi_struct *napi);
Polling is turned off with:
void netif_rx_complete(struct net_device *dev,
struct napi_struct *napi);
Since there can be more than one napi_struct structure in
existence, each can have polling enabled independently. Drivers are
responsible for disabling polling on all outstanding NAPI structures when
the interface is shut down (or when its stop() method is called).
The netif_poll_enable() and netif_poll_disable()
functions no longer exist, since polling is no longer tied to the
net_device structure. Instead, these functions should be used:
void napi_enable(struct napi *napi);
void napi_disable(struct napi *napi);
Networking maintainer David Miller, who has taken on the development of
this patch, says:
I don't anticipate making any more changes, just fixing bugs.
So please help me with that so we can finalize this patch. I
intend to cut a net-2.6.24 tree and stuff this patch into it by
the end of the week.
So anybody charged with maintaining out-of-tree network drivers should be
prepared for a significant API change in the 2.6.24 kernel.
Comments (1 posted)
By Jonathan Corbet
August 8, 2007
Among the metadata maintained by most filesystems is the last file access
time, or "atime." This time can be a useful value to have - it lets an
administrator (or a program) know when a file was last used. There is a
strong downside to this feature, though: it forces a write to the disk
every time a file is accessed. So read-only operations, which might have
been satisfied entirely from cache, turn into filesystem writes to keep the
atime value up to date.
A recent discussion on write throttling turned to atime after Ingo Molnar
pointed out that atime was probably a bigger performance problem than just
about everything else. He went on to say:
Atime updates are by far the biggest IO performance deficiency that
Linux has today. Getting rid of atime updates would give us more
everyday Linux performance than all the pagecache speedups of the
past 10 years, _combined_.
He also claimed that it was "perhaps the most stupid Unix design idea
of all times."
Such discussion leads quickly to the question of what should be done about
this old situation. One step that any Linux user can take now is to mount
filesystems with the noatime option, which turns off the tracking
of access times. For filesystem-intensive tasks, the performance reward
can be immediately apparent. Unfortunately, turning off atime
unconditionally will occasionally break software. Some mail tools will
compare modification and access times to determine whether there is unread
mail or not. The tmpwatch utility and some backup tools also use
atime and can misbehave if atime is not correct. For this reason,
distributors tend not to make noatime the default on installed
systems.
Another approach was added in 2.6.20: the relatime mount option. If
this flag is set, access times are only updated if they are (before the
update) earlier than the modification time. This change allows utilities
to see if the current version of a file has been read, but still cuts down
significantly on atime updates. This option is not heavily used, perhaps
because few people have heard of it and many distributions lack a version of
mount which is new enough to know about it. Using
relatime can still confuse tools which want to ask questions like
"has this file been accessed in the last week?"
To fix that problem, Linus suggested a
tweak to how relatime works: update it if the current value is
more than a certain time in the past - one day, for example. Ingo
responded with a patch
implementing that behavior and adding a couple of new boot options:
relatime_interval, which specifies the update interval in seconds,
and default_relatime, which turns on the relatime option
in all filesystems by default.
Something resembling this version of the patch might go into 2.6.24. It
was suggested that, whenever a file's inode is to be written to disk
anyway, the kernel might as well update atime as well. Alan Cox objected
that this change might make the overall behavior less predictable, which
might not be desirable. No new version of the patch with this feature has
been posted, so chances are it will not be in the version which gets merged
- if and when that happens.
Comments (14 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>