Brief items
The current development kernel is 2.5.44, which was
released by Linus on October 18.
This one contains more
read-copy-update work, lots of filesystem and block driver patches from
Alexander Viro, a PowerPC64 update, the x86 BIOS enhanced disk drive patch,
some SCSI work, a lot of device model patches, and many other fixes and
updates. See
the long-format changelog for all the details.
After releasing 2.5.44, Linus headed off to cruise in the Caribbean; thus,
there are no new changes in his BitKeeper tree, and there will be no more
official development kernel releases until after the 27th.
The current development kernel prepatch from Alan Cox is 2.5.44-ac2. This one makes some more
wide-ranging changes, including the incorporation of interrupt stack
support, an updated LVM2, and numerous other fixes and updates.
The current 2.5 Kernel Status Summary from
Guillaume Boissiere is dated October 23.
The current stable kernel is 2.4.19; no 2.4.20 prepatches have been
released in the last week.
Comments (none posted)
Kernel development news
Linus is spending the week cruising through the Caribbean; when he returns,
just a few days will remain before the Halloween feature freeze date.
There has been a lively discussion of the patches which will be waiting for
him when he gets back. Rob Landley has compiled
a list of 2.5 merge candidates based on those
discussions. The list is a good summary of what's still waiting in the
wings, but it assumes the reader understands what the various
patches are. So here's an annotated version:
- The new kernel configuration system. The new configuration
code has been generally well received; even the Qt-based graphical
configuration tool hasn't drawn a lot of complaints. Merging seems
likely, perhaps without the graphical tools, which, Linus thinks,
might be better off outside the kernel. (Covered here October 10).
- Extended attributes and ACLs for ext2 and ext3. Ted Ts'o has
taken on the work of fixing up and submitting this patch set. There
has been some concern, based on Red Hat's having evidently pulled the
ACL patches from their 8.0 kernel during the beta stage. But it would
be surprising if this patch did not get in in some form; access
control lists are a fairly basic requirement for a lot of users. Some
Linux filesystems (JFS, XFS) already support ACLs.
- Linux Trace Toolkit. LTT is a general tool for the tracing of
events in a Linux system, in both user and kernel space. It's not
clear whether this one will get in; not everybody is convinced that
this patch needs to be in the mainline kernel. (Covered briefly on April 18).
- LVM2 and/or EVMS. The 2.5 kernel currently lacks a working
logical volume manager, but there is little consensus on which of
these two developments should fill that gap. It is possible (though
unlikely) that the next stable kernel could ship without any volume
manager at all. (Covered here last
week).
- Shared page tables. This patch, originally by Daniel Phillips,
has since been picked up by Dave McCracken and fixed up for
inclusion. Shared page tables have a couple of benefits: they reduce
the time taken by the fork() system call (since page tables
and rmap structures need not be copied), and they reduce page table
overhead for systems (i.e. databases) using very large shared
segments. The patch has been slow to stabilize, and may appear too
risky for inclusion at this late date, but one never knows. (Covered
back in January).
- Large page support is another way of reducing overhead for
very large shared segments. A large page patch went into the kernel a
little while back (covered August 10), but it is difficult for
applications to use and does not work with shared segments, which is
where people really want this capability. A number of patches
currently exist in Andrew Morton's "-mm" tree which address these
problems.
- Dynamic probes/KProbes. This patch allows the placement of
debugger-style breakpoints at arbitrary locations within the kernel.
There is some pressure to get this one merged, but Linus has not taken
it so far.
- High-resolution timers. This longstanding patch by George
Anzinger implements the POSIX
timers specification. There are some concerns about how this
patch is implemented, and recently an
alternative version of the patch has surfaced. There is demand
for this capability; with luck some version of the patch will get in.
- Linux Kernel Crash Dumps. This is another patch which has been
around for a long time; it allows the creation of a full dump of the
kernel's state when it crashes. The purpose, of course, is to enable
vendors to debug crashes remotely. (First LWN mention: November 18, 1999).
- Console layer rewrite. This is mostly a massive cleanup
project which is getting a lot of the ancient cruft out of the console
layer while adding some new features. Parts of this work have just been finished recently.
- kexec. This relatively new patch adds a kexec()
system call that allows booting another kernel directly from Linux.
With this patch, one can reboot (possibly into another operating
system) without having to go through the whole BIOS startup routine
again. This patch is quite new and has some open issues; it may be a
better candidate for the next development series.
- USAGI IPv6. The USAGI project has been working in improved
IPv6 support for some time, and has released a comprehensive set of
patches. The word from David Miller,
however, is that the networking developers want to take a different
direction for IPv6 support (and CryptoAPI and IPSec as well).
"We will be incorporating lots of ideas and small code pieces
from USAGI's work, but most of the core engine will be a new
implementation." They intend to have this work complete and
ready for merging by Linus's return. That will be a big pile of new
code, however, that few people have seen.
- uClinux. This is the classic patch for running Linux on
systems with no memory management unit. It has recently been ported
forward to 2.5 and proposed for submission; Alan Cox has merged it
into his tree. New architectures are usually not that hard to get in,
and there has not been much opposition to this one.
- sys_epoll. This is the new incarnation of the
/dev/epoll patch, which seeks to make a faster, more scalable
poll() interface. The patch has been reworked into system
call form now, and might just get past Linus this time.
- New CD burning patch. This brand-new patch from Jens Axboe
(finally) allows the use of DMA operations when burning CDs. It also
turns burning into a zero-copy operation. The result should be
faster, more reliable CD writing. (Patch posted on October 23).
- In-kernel module loader. Rusty Russell's in-kernel module
loader patch is advertised as being safer and more capable than the
old, user-space implementation while simultaneously requiring less
kernel code. (Covered here September 26).
- Boot/module parameter rework. This patch made Rob's list,
but there has been little work in this area recently. Many of the
ideas from this work have been folded into the device model code.
(Covered here in June as part of the
Kernel Summit writeup).
- Hotplug CPU. This is another Rusty Russell patch which has
been around for a while. It seems to work, but has few users - most
of us don't pull processors out of running systems. Its application,
of course, is for high-availability systems and such.
- The unlimited groups patch. This is a recent patch which would
allow the kernel to support very large numbers of groups - the
developers have tested it with 10,000 at a time.
- Initramfs. This patch allows a disk image to be appended
directly to the kernel executable; it would then contain much of the
bootstrap code that is now found in the kernel itself. This patch
reduces the size of the kernel itself while making it far easier for
users to customize the early bootstrap process; it could be especially
useful for embedded systems. Much of this code has been ready for
some time; it has mostly been a matter of getting the user-space side
of things into shape. (Covered here August, 2001 and January, 2002).
- ReiserFS 4. This is a completely new version of the Reiser
filesystem; almost nobody has seen it, but it is supposed to show up
shortly in condition for merging.
- A larger dev_t. Supporting larger numbers of devices was high
on the list of things to do before the 2.5 series even started, but
the enlargement of the dev_t type still has not happened.
This one is on Alexander Viro's plate; he has been pushing through
other changes (in the block layer, mostly) that are prerequisites to
the dev_t change. (Covered here December, 2001).
That, of course, is a rather lengthy list. Much of this stuff is clearly
not going to get in to the 2.5 kernel - at least, not if the feature freeze
holds as intended. At this point, it's mostly a matter of waiting until
Linus returns and seeing what he decides to do.
Comments (6 posted)
Linus and numerous other kernel developers dislike the
ioctl()
system call, seeing it as an uncontrolled way of adding new system calls to
the kernel. Putting new files into
/proc is also discouraged,
since that area is seen as being a bit of a mess. Developers who populate
their code with
ioctl() implementations or
/proc files
are often encouraged to create a standalone virtual filesystem instead.
Filesystems make the interface explicit and visible in user space; they
also make it easier to write scripts which perform administrative
functions. But the writing of a Linux filesystem can be an intimidating
task. A developer who has spent some time just getting up to speed on the
driver interface can be forgiven for balking at having to learn the VFS API
as well.
The 2.6 kernel, as of the 2.5.7 release, contains a set of routines called
"libfs" which is designed to make the task of writing virtual filesystems
easier. libfs handles many of the mundane tasks of implementing the Linux
filesystem API, allowing non-filesystem developers to concentrate (mostly)
on the specific functionality they want to provide. What it lacks,
however, is documentation. Your author decided to take a little time away
from subscription management code to play a bit with libfs; the following
describes the basics of how to use this facility.
The task I undertook was not particularly ambitious: export a simple
filesystem (of type "lwnfs") full of counter files. Reading one of these
files yields the current value of the counter, which is then incremented.
This leads to the following sort of exciting interaction:
# cat /lwnfs/counter
0
# cat /lwnfs/counter
1
# ...
Your author was able to amuse himself well into the thousands this way;
some users may tire of this game sooner, however. The impatient can get to
higher values more quickly by writing to the counter file:
# echo 1000 > /lwnfs/counter
# cat /lwnfs/counter
1000
#
OK, so it's not going to be at the top of the list
of things for Linus to merge once he returns, tanned, rested, and ready,
from his Caribbean cruise, but it's OK
as a way of showing the simplest possible filesystem. Numerous code
samples will be shown below; the full module is also available on this page.
Initialization and superblock setup
So let's get started.
A loadable module which implements a filesystem must, at load time,
register that filesystem with the VFS layer. The lwnfs module
initialization code is simple:
static int __init lfs_init(void)
{
return register_filesystem(&lfs_type);
}
module_init(lfs_init);
The lfs_type argument is a structure which is set up as follows:
static struct file_system_type lfs_type = {
.owner = THIS_MODULE,
.name = "lwnfs",
.get_sb = lfs_get_super,
.kill_sb = kill_litter_super,
};
This is the basic data structure which describes a filesystem time to the
kernel; it is declared in <linux/fs.h>. The owner
field is used to manage the module's reference count, preventing unloading
of the module while the filesystem code is in use. The name is
what eventually ends up on a mount command line in user space.
Then there are two functions for managing the filesystem's superblock - the
root of the filesystem data structure. kill_litter_super() is a
generic function provided by the VFS; it simply cleans up all of the
in-core structures when the filesystem is unmounted; authors of simple
virtual filesystems need not worry about this aspect of things. (It
is necessary to unregister the filesystem at unload time, of course;
see the source for the lwnfs exit function).
The creation of the superblock must be done by the filesystem
programmer. The task has gotten simpler, but still involves a bit of
boilerplate code. In this case, lfs_get_super() hands off the task
as follows:
static struct super_block *lfs_get_super(struct file_system_type *fst,
int flags, const char *devname, void *data)
{
return get_sb_single(fst, flags, data, lfs_fill_super);
}
Once again, get_sb_single() is generic code which handles much of
the superblock creation task. But it will call lfs_fill_super(),
which performs setup specific to our particular little filesystem. It's
prototype is:
static int lfs_fill_super (struct super_block *sb, void *data, int silent);
The in-construction superblock is passed in, along with a couple of other
arguments that we can ignore. We do have to fill in some of the superblock
fields, though. The code starts out like this:
sb->s_blocksize = PAGE_CACHE_SIZE;
sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
sb->s_magic = LFS_MAGIC;
sb->s_op = &lfs_s_ops;
All virtual filesystem implementations have something that looks like this;
it's just setting up the block size of the filesystem, a "magic number" to
recognize superblocks by, and the superblock operations. These operations
need not be written for a simple virtual filesystem - libfs has the stuff
that is needed. So lfs_s_ops is defined (at the top file level) as:
static struct super_operations lfs_s_ops = {
.statfs = simple_statfs,
.drop_inode = generic_delete_inode,
};
Creating the root directory
Getting back into
lfs_fill_super(), our big remaining task
is to create and populate the root directory for our new filesystem. The
first step is to create the inode for the directory:
root = lfs_make_inode(sb, S_IFDIR | 0755);
if (! root)
goto out;
root->i_op = &simple_dir_inode_operations;
root->i_fop = &simple_dir_operations;
lfs_make_inode() is a boilerplate function that we will look at
eventually; for now, just assume that it returns a new, initialized inode
that we can use. It needs the superblock and a mode argument,
which is just like the mode value returned by the stat() system
call. Since we passed S_IFDIR, the
returned inode will describe a directory. The file and directory
operations that we assign to this inode are, again, taken from libfs.
This directory inode must be put into
the directory cache (by way of a "dentry" structure)
so that the VFS can find it; that is done as follows:
root_dentry = d_alloc_root(root);
if (! root_dentry)
goto out_iput;
sb->s_root = root_dentry;
Creating files
The superblock now has a fully initialized root directory. All of the
actual directory operations will be handled by libfs and the VFS layer, so
life is easy.
What libfs cannot do, however, is actually put anything of interest into
that root directory – that's our job. So the final thing that
lfs_fill_super() does before returning is to call:
lfs_create_files(sb, root_dentry);
In our sample module, lfs_create_files() creates one counter file
in the root directory of the filesystem, and another in a subdirectory.
We'll look mostly at the root-level file.
The counters are implemented as atomic_t
variables; our top-level counter (called, with great imagination,
"counter") is set up as follows:
static atomic_t counter;
static void lfs_create_files (struct super_block *sb, struct dentry *root)
{
/* ... */
atomic_set(&counter, 0);
lfs_create_file(sb, root, "counter", &counter);
/* ... */
}
lfs_create_file does the real work of making a file in a
directory. It has been made about as simple as possible, but there are
still a few steps to be performed. The function starts out as:
static struct dentry *lfs_create_file (struct super_block *sb,
struct dentry *dir, const char *name,
atomic_t *counter)
{
struct dentry *dentry;
struct inode *inode;
struct qstr qname;
Arguments include the usual superblock structure, and dir, the
dentry for the directory that will contain this file. In this case,
dir will be the root directory we created before, but it could be
any directory within the filesystem.
Our first task is to create a directory entry for the new file:
qname.name = name;
qname.len = strlen (name);
qname.hash = full_name_hash(name, qname.len);
dentry = d_alloc(dir, &qname);
The setting up of qname just hashes the filename so that it can be
found quickly in the dentry cache. Once that's done, we create the entry
within our parent dir. The file also needs an inode, which we
create as follows:
inode = lfs_make_inode(sb, S_IFREG | 0644);
if (! inode)
goto out_dput;
inode->i_fop = &lfs_file_ops;
inode->u.generic_ip = counter;
Once again, we call lfs_make_inode (which we will look at shortly,
honest), but this time we use it to create a regular file. The key to the
creation of special-purpose files in virtual filesystems is to be found in
the other two assignments:
- The i_fop field is set up with our file operations which will
actually implement reads and writes on the counter.
- We use the u.generic_ip pointer in the inode to stash aside a
pointer to the atomic_t counter associated with this file.
In other words, i_fop defines the behavior of this particular
file, and u.generic_ip is the file-specific data. All virtual
filesystems of interest will make use of these two fields to set up the
required behavior.
The last step in creating a file is to add it to the dentry cache:
d_add(dentry, inode);
return dentry;
Putting the inode into the dentry cache allows the VFS to find the file
without having to consult our filesystem's directory operations. And that,
in turn, means our filesystem does not need to have any directory
operations of interest. The entire structure of our virtual filesystem
lives in the kernel's cache structure, so our module need not remember the
structure of the filesystem it has set up, and it need not implement a
lookup operation. Needless to say, that makes life easier.
Inode creation
Before we get into the actual implementation of the counters, it's time to
look at
lfs_make_inode(). The function is pure boilerplate; it
looks like:
static struct inode *lfs_make_inode(struct super_block *sb, int mode)
{
struct inode *ret = new_inode(sb);
if (ret) {
ret->i_mode = mode;
ret->i_uid = ret->i_gid = 0;
ret->i_blksize = PAGE_CACHE_SIZE;
ret->i_blocks = 0;
ret->i_atime = ret->i_mtime = ret->i_ctime = CURRENT_TIME;
}
return ret;
}
It simply allocates a new inode structure, and fills it in with values that
make sense for a virtual file. The assignment of mode is of
interest; the resulting inode will be a regular file or a directory (or
something else) depending on how mode was passed in.
Implementing file operations
Up to this point, we have seen very little that actually makes the counter
files work; it's all been VFS boilerplate so that we have a little
filesystem to put those counters into. Now the time has come to see how
the real work gets done.
The operations on the counters
themselves are to be found in the file_operations structure that
we associate with the counter file inodes:
static struct file_operations lfs_file_ops = {
.open = lfs_open,
.read = lfs_read_file,
.write = lfs_write_file,
};
A pointer to this structure, remember, was stored in the inode by
lfs_create_file().
The simplest operation is open:
static int lfs_open(struct inode *inode, struct file *filp)
{
filp->private_data = inode->u.generic_ip;
return 0;
}
The only thing this function need do is move the pointer to the
atomic_t pointer over into the file structure, which
makes it a bit easier to get at.
The interesting work is done by the read function, which must
increment the counter and return its value to the user space program. It
has the usual read operation prototype:
static ssize_t lfs_read_file(struct file *filp, char *buf,
size_t count, loff_t *offset)
It starts by reading and incrementing the counter:
atomic_t *counter = (atomic_t *) filp->private_data;
int v = atomic_read(counter);
atomic_inc(counter);
This code has been simplified a bit; see the module source for a couple of
grungy, irrelevant details. Some readers will also notice a race condition
here: two processes could read the counter before either increments it; the
result would be the same counter value returned twice, with certain dire
results. A serious module would probably serialize access to the counter
with a spinlock. But this is supposed to be a simple demonstration.
So anyway, once we have the value of the counter, we
have to return it to user space. That means encoding it into character
form, and figuring out where and how it fits into the user-space buffer.
After all, a user-space program can seek around in our virtual file.
len = snprintf(tmp, TMPSIZE, "%d\n", v);
if (*offset > len)
return 0;
if (count > len - *offset)
count = len - *offset;
Once we've figured out how much data we can copy back, we just do it,
adjust the file offset, and we're done.
if (copy_to_user(buf, tmp + *offset, count))
return -EFAULT;
*offset += count;
return count;
Then, there is lfs_write_file(), which allows a user to set the
value of one of our counters:
static ssize_t lfs_write_file(struct file *filp, const char *buf,
size_t count, loff_t *offset)
{
atomic_t *counter = (atomic_t *) filp->private_data;
char tmp[TMPSIZE];
if (*offset != 0)
return -EINVAL;
if (count >= TMPSIZE)
return -EINVAL;
memset(tmp, 0, TMPSIZE);
if (copy_from_user(tmp, buf, count))
return -EFAULT;
atomic_set(counter, simple_strtol(tmp, NULL, 10));
return count;
}
That is just about it. The module also defines lfs_create_dir,
which creates a directory in the filesystem; see the full source for how
that works.
Conclusion
The libfs code, as demonstrated here, is sufficient for a wide variety of
driver-specific virtual filesystems. Further examples can be found in the
2.5 kernel source in a few places:
- drivers/hotplug/pci_hotplug_core.c
- drivers/usb/core/inode.c
- drivers/oprofile/oprofilefs.c
- fs/ramfs/inode.c
...and in a few other spots – grep is your friend.
Keep in mind,
that the 2.5 driver model code makes it easy for drivers to export
information within its own virtual filesystem; for many applications, that
will be the preferred way of making information available to user space.
For cases where only a custom filesystem will do, however, libfs makes the
task (relatively) easy.
Comments (25 posted)
The Linux Security Module patches have been having a rough time of it
recently. The latest indignity came along when Christoph Hellwig
noticed the sys_security() system call
and promptly sent out a patch to remove it.
sys_security() is defined as:
int sys_security(unsigned int id, unsigned call,
unsigned long *args);
Its purpose is to allow security modules to provide specific services
without the need to register their own system calls. In the case of
SELinux, sys_security() replaces what would otherwise be 52
different system calls.
So why remove sys_security()? There are two reasons, both
relating to a dislike of ioctl()-style calls. This style of call
uses an integer parameter (call, in this case) to choose between
several different operations; the arguments passed in are different for
every operation and have no well-defined type or meaning. This type of
system call argument creates problems for certain architectures, especially
those which have a 64-bit kernel space and a 32-bit user space (such as the
Sparc). On such systems, system call parameters must be converted between
the two views of the world, and there is no way to reliably do that
conversion if the types of the arguments are not known.
But even without that issue sys_security() would be in trouble.
This sort of "multiplexor" system call allows modules to add almost any
sort of functionality imaginable, without any sort of review. That freedom
leads to inconsistent messes, as is the case with many ioctl()
calls, or to the addition of functionality that perhaps should not be
there.
The word from the kernel developers seems to be that each security module
which needs system calls should register them separately. That way each
system call can be judged on its own merits. This approach partially
defeats the purpose of the LSM patch, which was intended to make security
regimes interchangeable. But the kernel developers, many of whom do not
much like the LSM patch to begin with, seem willing to pay that price.
Comments (none posted)
The OSDL
Data Center Linux
Project has set out to:
Develop and evangelize the roadmap for a Linux platform software
that supports commercial software products and corporate IT
requirements, enabling developers to create Linux based solutions
for the data center market segment.
If that does not contain enough marketing-speak for you, there are white
papers available on the project web site. For the rest of us, what may be
of more interest is the first DCL kernel release, 2.5.44-dcl1. In addition to a few fixes, this
patch includes the Linux Kernel Crash Dump patch, EVMS, an enhanced NUMA
scheduler, and a few other tweaks. The project plans to add the shared
page table and high-resolution timers patches before too long. The -dcl
patches will show up in the "kernel trees" part of the patches section at
the bottom of this page.
Comments (none posted)
Patches and updates
Kernel trees
Build system
Core kernel code
- john stultz: linux-2.5.34_vsyscall_A0. "<span>This is a port of Andrea's x86-64 vsyscall(userspace) gettimeofday to
i386. Its fairly untested, but it works here!</span>"
(October 18, 2002)
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>