Release status
Kernel release status
The current 2.6 development kernel is 2.6.27-rc4,
released on August 20.
Along with lots of fixes, it includes support for the multitouch trackpad
on new Apple laptops, more reshuffling of architecture-specific include
files, a number of XFS improvements, interrupt stacks for the SPARC64
architecture, the removal of the obsolete Auerswald USB sound driver, and
new drivers for TI TUSB 6010 USB controllers, Inventra HDRC USB
controllers, and National Semiconductor adcxx8sxxx analog-to-digital
converters. The short-form changelog is in the announcement, or see
the
long-form changelog for all the details.
Patches continue to accumulate in the mainline repository, and 2.6.27-rc5
may be released by the time you read this. Along with the fixes, -rc5 will
include Japanese translations of more kernel documents and generic interrupt
handling for some UIO devices.
Comments (none posted)
Kernel development news
Quotes of the week
NOTE: When tempted to use multibyte bitfields on fixed-layout data,
you need to look in the mirror, ask yourself "what will they do to
me during code review for that?", shudder and decide that some
temptations are just not worth the pain.
--
Al Viro
The longer you queue stuff up, the more painful having to change
stuff at the beginning of that queue becomes. It can invalidate
everything else you worked on.
The only sane way to operate is to post your work early and often,
or else you'll live in a world of hurt, and it will be nobody's
fault but your own.
--
David Miller
linux-next is not a tree that you can track. It's a tree that you
can fetch _once_ and then throw away.
So what you can do is to "fetch" linux-next, and test it. But you
MUST NEVER EVER use it for anything else. You can't do development
on it, you cannot rebase onto it, you can't do _anything_ with it.
--
Linus Torvalds
We get better regression tests from getting stuff upstream. However
upstreaming stuff to Linus is not how to find regressions, it helps
but its suboptimal in that he will eventually ignore us if the
regression rate gets too high.
--
Dave Airlie
Comments (none posted)
AXFS: a compressed, execute-in-place filesystem
By Jonathan Corbet
August 26, 2008
Filesystems are clearly an area of high development interest at the moment;
hardly a week goes by without a new filesystem release for Linux popping up
on some list or other. All of this development is motivated by a number of factors,
including the increasing size of storage devices and the increasing
capability of solid-state storage. Beyond that, though, there is the simple
fact that there is no single filesystem which is optimal for all
applications. The recently-announced
AXFS filesystem is a clear
example of what can be done if one targets a specific use case and
optimizes for that case only.
At a first impression, AXFS seems like a simple and limited filesystem. It
is, for example, read-only; the AXFS developers have made no provision for
changing the filesystem after it is created. Some filesystems have a great
deal of code dedicated to the creation of the optimal layout of file blocks
on disk; AXFS has none of that. Instead, it has a simple format which
divides the media into "regions" and, almost certainly, spreads accesses
across the device. There is no journaling, no logging, no snapshots, and
no multi-device volume management.
What AXFS does provide is compressed storage using zlib. It is, clearly,
aimed at embedded systems using flash-based storage. For such devices, a
compressed filesystem can be built using the provided tools, then loaded
into a minimal amount of flash on each device. It thus joins a number of
other compressed filesystems - cramfs and squashfs, for example - provided
for this sort of application. One interesting aspect of compressed,
flash-oriented filesystems is their apparent ability to stay out of the mainline
kernel. By posting AXFS for review on linux-kernel, developer Jared
Hulbert may be trying to avoid a similar fate.
The feature which makes AXFS different from squashfs and cramfs is its
support for execute-in-place (XIP) files. Some types of flash can be
mapped directly into the processor's address space. When running programs
stored on that flash, copying pages of executable code from flash into main
memory seems like a bit of a waste: since that code is already addressable
by the processor, why not run it from the flash? Executing code directly
from flash saves RAM; it also makes things faster by eliminating the need
to copy those pages into RAM at page fault time. As a result, systems
using XIP tend to boot more quickly, a feature which designers (and users)
of embedded systems appreciate.
Linux has had an execute in place
mechanism for a few years now, but relatively few filesystems make use
of it. AXFS has been designed from the beginning to facilitate XIP
operation - that's its reason for existence (and the origin of the "X" in
its name).
There is an additional twist, though. One would ordinarily consider
compressed storage and XIP to be mutually exclusive - there is little value
in mapping compressed executable code into a process's address space. To
be able to executed in place, a page of code must be stored uncompressed.
What makes AXFS unique is its ability to mix compressed and uncompressed
pages in the same executable file. So pages which will be frequently
accessed can be stored uncompressed and executed in place. Pages with
infrequently-needed code or which contain initialized data can be stored
compressed to save space and uncompressed at fault time.
This is a slick feature, but it is not of great use if one does not know
which pages of an executable file are heavily enough used to justify
storing them without compression. Trying to determine this information and
manually pick the representation of each page seems like an error-prone
exercise - not to mention one which would tend to create high employee
turnover. So another method is needed.
To that end, AXFS provides a built-in profiling mechanism. Each AXFS
filesystem is represented by a virtual file under /proc/axfs;
writing "on" to that file will cause AXFS to make a note of every
page fault within the filesystem. Reading that file then yields
spreadsheet-like output showing, for each file, how many times each page
was faulted into the page cache. Using this data, it is possible to
generate an AXFS filesystem image with an optimal number of compressed
pages for the target system.
Filesystems normally need a few rounds of review before they can make it
into the mainline; some filesystems need rather more than that. AXFS is
sufficiently simple, though, that it may find a quicker path into the
kernel. So far, the comments have mostly been positive, with the biggest
complaint being, perhaps, that its name is
too close to that of the existing XFS filesystem. So a 2.6.28 merge for
AXFS, while far from guaranteed, would appear to be not entirely out of the
question.
Comments (2 posted)
Sysfs and namespaces
By Jonathan Corbet
August 26, 2008
Support for network namespaces - allowing different groups of processes to
have a different view of the system's network interfaces, routes, firewall
rules, etc. - is nearing completion in recent kernels. A look at
net/Kconfig turns up something interesting, though: network
namespaces can only be enabled in kernels which do not support sysfs - the
two are mutually exclusive. Since most system configurations need sysfs to
work properly, this limitation has made it harder than it would otherwise
be to use, or even test, the network namespace feature.
The problem is simple: the network subsystem creates a sysfs directory for
each network interface on the system. For example, eth0 is
represented by /sys/class/net/eth0; therein one can find out just
about anything about how eth0 is configured, query its packet
statistics, and more. But, when network namespaces are in use, one group
of processes may have a different eth0 than another, so they
cannot share a globally-accessible sysfs tree. One solution might be to
add the network namespace as an explicit level in the sysfs tree, but that
would break user-space tools and fails to properly isolate namespaces from
each other. The real solution is to build namespace awareness more deeply
into sysfs.
Eric Biederman has been working on a set of sysfs namespace patches for the
last year or so; those patches now appear to be getting close to ready for
inclusion into the mainline. Network namespaces will be the first user of
this feature, but it has been written in a way that makes it possible for
any system namespace to provide differing views of parts of the sysfs
hierarchy.
The core concept is that of a "tagged" directory in sysfs. Any sysfs
directory can be associated with (at most) one type of tag, where that type
identifies which type of namespace controls how that directory is viewed.
Thus, for example, /sys/class/net would have a tag type
identifying the network namespace subsystem as the one which is in control
there. The /sys/kernel/uids directory, instead, will be managed
by the user namespace subsystem.
Once a directory is given a tag type, all subdirectories and
attribute files inherit the same type.
Namespace code makes use of tagged sysfs directories by adding an entry to
enum sysfs_tag_type, defined in <linux/sysfs.h>, to
identify its specific tag type.
The namespace must also create an operations structure:
struct sysfs_tag_type_operations {
const void *(*mount_tag)(void);
};
The purpose of the mount_tag() method is to return a specific tag
(represented by a void * pointer) for the current
process. This tag, normally, will just be a pointer to the structure
describing the relevant namespace; for example, network namespaces
implement this method as follows:
static const void *net_sysfs_mount_tag(void)
{
return current->nsproxy->net_ns;
}
The tag operations must be registered with sysfs using:
int sysfs_register_tag_type(enum sysfs_tag_type type,
struct sysfs_tag_type_operations *ops);
Thereafter, there are two ways of associating tags with a sysfs hierarchy.
One of those is to make a tagged directory directly with:
int sysfs_make_tagged_dir(struct kobject *kobj,
enum sysfs_tag_type type);
The directory associated with kobj will have differing contents
depending on the value of the tag of the given type. The actual
tag associated with the contents of this directory will be determined (at
creation time) by calling a new function added to the kobj_type
structure:
const void *(*sysfs_tag)(struct kobject *kobj);
The sysfs_tag() function is usually a short series of
container_of() calls which, eventually, locates the appropriate
namespace for the given kobj.
An alternative way to attach tags to a directory tree is to associate it
directly with the class structure. To that end, struct
class has two new members:
enum sysfs_tag_type tag_type;
const void *(*sysfs_tag)(struct device *dev);
When the class is instantiated, it will have tags of the given
tag_type; the specific tag for a given class will be found by
calling the sysfs_tag() function.
Finally, if a specific tag ceases to be valid (because the associated
namespace is destroyed, normally), a call should be made to:
void sysfs_exit_tag(enum sysfs_tag_type type, const void *tag);
This call will cause all sysfs directories with the given tag to
become invisible, and to be deleted when it is safe to do so.
Adding tagged directory support requires some significant changes to the
sysfs code. But the interface has been designed to make it very easy for
other subsystems to make use of tagged directories; it's a simple matter of
providing functions to return the specific tag values which should be
used. At this point, the biggest challenge might be making sense of sysfs
when its contents may be different for each observer. But that is a
challenge associated with namespaces in general.
Comments (9 posted)
TALPA strides forward
By Jake Edge
August 27, 2008
When last we left TALPA, it was still floundering around without a
solid threat model, but over the last several weeks that part has
changed. Eric Paris proposed
a fairly straightforward—though still somewhat
controversial—model for the threats that TALPA is supposed to
handle. With that in place, there is at least a framework for kernel
hackers to evaluate different
ways to solve the problem, while also keeping in mind other potential uses.
It seems almost tautological, but anti-virus scanning is supposed
to, well, scan files. In particular, they scan for known
malware and block access to files when
they are found to be infected. For better or worse, scanning files is seen
as an
essential security mechanism by many, so TALPA is trying to provide a means
to that end. Paris describes it this way:
This is a file scanner. There may be all
sorts of marketing material or general beliefs that they provide
security against all sorts of wide and varied threats (and they do),
but in all reality the only threats they provide any help against are
those that can be found by scanning files. Simple as that. Some may
argue this isn't "good" security and I'm not going to make a strong
argument to the contrary, I can stand here for days and show cases where
this is highly useful but no one can provide a threat model more than to
say, "we attempt to locate files which may be harmful somewhere in the
digital ecosystem and try to deny access to that data."
All of the various scenarios where active processes can infect files with
malware or actively try to avoid scanning can be ignored under this model.
While this looks like "security theater" to some, it avoids the endless
what-ifs that were bogging down earlier discussions. It may not be a
threat model
that appeals to many of the kernel hackers, but it is one that they can
work with.
To many kernel developers—used to efficiency at nearly any
cost—time
consuming filesystem scans seem ludicrous, especially since they only
"solve" a subset of the malware problem. But the fact remains that Linux
users, particularly in "enterprise" environments, believe they need this
kind of scanning and are willing to pay for products that provide it. The
current methods used by anti-virus vendors to do the scanning are
problematic at best, causing users to run kernels tainted with binary
modules. With a threat model—however limited—in place, work
can proceed
to find the right way to add this functionality into the kernel.
Paris is narrowing in on a design that
calls out to user space, both synchronously and asynchronously depending on
the operation. File access might go something like this:
- open() - causes interested user-space programs to be notified
asynchronously; anti-virus scanners might kick off a scan if needed
- read()/mmap() - causes a synchronous user-space
notification, which
allows anti-virus scanners to block access until scanning is complete; if
malware is found, cause the read/mmap to return an error
- write() - whenever the modification time (mtime) of a file is
updated, asynchronously notify user space; this would allow anti-virus
scanners to re-scan the data as desired
- close() - asynchronous user-space notification; another place
where anti-virus scanners could re-scan if the file has been dirtied
Where and how to store the current scanning status of a file is still an
open question. Various proposals have been discussed, starting with a
non-persistent flag in the in-memory inode of a file. While simple, that
would cause a lot of unnecessary additional scanning as inodes drop out
of the cache. Persistent storage of the scanned status of a file
alleviates that problem, but runs into another: how do you handle multiple
anti-virus products (or, more generally, scanners of various sorts); whose
status gets stored with the inode?
For this reason, user-space scanners will need to keep their own database
of information about which inodes have been scanned. For anti-virus
scanners, they will also want to keep information about which version of
the virus database was used. Depending on the application, that
could be stored in extended attributes (xattrs) of the file or in some
other application-specific database. In any case, it is not a problem for
the kernel, as Ted T'so points out:
I'm just arguing that there should be absolutely *no* support in the
kernel for solving this particular problem, since the question of
whether a file has been scanned with a particular version of the virus
DB is purely a userspace problem.
It is important to keep the scanned status out of kernel purview in order
to ensure that policy decisions are not handled by the kernel itself. This
is in keeping with the longstanding kernel development principle that user
space should make all policy decisions. This allows new applications to
come along, ones that were perhaps never envisioned when the feature was
being designed. For example, Alan Cox describes another reason that the state of the
file with respect to scanning should be kept in user space:
This is another application layer matter. At the end of the day why does
the kernel care where this data is kept. For all we know someone might
want to centralise it or even distribute it between nodes on a clustered
file system.
The latest TALPA design includes an in-memory clean/dirty flag that can
short circuit the blocking read notification (when clean). That flag gets
set to dirty whenever there is an mtime modification. This optimizes the
common case of reading a file that hasn't changed. Further optimizations
are possible down the line as Paris mentions:
If some general xattr namespace is agreed upon for such a thing someday
a patch may be acceptable to clear that namespace on mtime update, but I
don't plan to do that at this time since comparing the timestamp in the
xattr vs mtime should be good enough.
Various other uses for the kinds of hooks proposed for TALPA have also come up
in the discussion. Hierarchical storage management, where data is
transparently moved between different kinds of media, might be able to use
the blocking read mechanism. File indexing applications and intrusion
detection systems could use the mtime change notification as well. This is
a perfect example of kernel development in action; after a rough start,
the TALPA folks have done a much better job working with the
community.
Some might argue that the kernel development process is somehow suboptimal,
but it is the only way to get things into Linux. As the earlier adventures
of TALPA would indicate, flouting kernel tradition is likely to go
nowhere. While it is still a long way from being included—pesky
things like working code are still needed—it is clearly on a path to
get there some day, in one form or another.
Comments (48 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
- Roland McGrath: utrace.
(August 27, 2008)
Device drivers
Documentation
Filesystems and block I/O
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>