Brief items
The current development kernel remains 2.6.0-test9, as it has been
since October 25. The slow, steady accumulation of patches (all
relatively important fixes) in Linus's BitKeeper repository continues,
however. One of those patches disables the IDE tagged command queueing
feature, since it does not look like that code will be sufficiently stable
by the 2.6.0 release.
The current stable kernel is 2.4.22; Marcelo announced the first 2.4.23 release candidate on
November 10. The time has come for testing of this release to help
ensure a solid 2.4.23 kernel in the near future.
Comments (none posted)
Kernel development news
Linux has long had a priority mechanism which controls access to the
processor(s). Other system resources, however, are not so easily managed.
Often, the real performance bottleneck is not the processor, but some other
resource, such as I/O bandwidth to a disk drive. If disk I/O is the real
limiting factor, even a very low-priority process can, by creating many I/O
requests, strongly affect the performance of higher-priority processes on
the system.
Jens Axboe has now taken a stab at the I/O priority issue with a new version of his "completely fair queueing"
(CFQ) I/O scheduler. We first mentioned the
CFQ scheduler back in February; it works by creating a separate request
queue for every process issuing disk I/O and taking an equal number of
requests from each one of them. In this way, it seeks to distribute the
available I/O bandwidth equally across processes in the system and produce
"completely fair" results.
The new version gives each process an I/O priority, which is a number
between zero and 20 (inclusive). At the bottom end, disk I/O is only
allowed when the
disk would otherwise be idle. A priority of 20, instead, is the
"real-time" level; all requests at that level are satisfied before any
other requests are considered. The levels in between are for normal
processes; by default, the I/O priority is set to 10. A pair of system
calls has been added to adjust the I/O priority of a process, though the
form of those calls is likely to change in the future.
Internally, the per-process request queues have now been divided into an
array of 21 lists, one for each priority level. There is also a dispatch
queue, which contains the requests which have been selected for processing
next. A separate dispatch queue is still needed to allow some amount of
request ordering and merging.
When the time comes to fill the dispatch queue, the new scheduler starts
with the real-time queue. If requests are waiting there, they go straight
into the dispatch queue and the process is complete. There is also an
anticipatory scheduling feature for real-time requests: when the last
real-time request is processed, the scheduler will wait a short period
(10ms, currently) to see if any more real-time requests show up before
opening the floodgates for everybody else.
In the absence of real-time requests, the code passes through each priority
level, taking a decreasing number of requests from each one. Each process
gets to contribute one request at a time to the dispatch queue until the
quota for its priority level (expressed in both the number of requests and
the number of sectors to transfer) has been reached. Requests are only
taken from the idle priority queue if no other requests have been
dispatched for a configurable period of time (default 100ms).
With the new CFQ scheduler, an I/O request may not be serviced even after
it makes it into the dispatch queue. If a new request with real-time
priority shows up, all lower-priority requests are yanked back out of the dispatch
queue and have to go through the whole process again. Similarly, any
non-idle requests will cause any pending idle-priority requests to lose
their place in the dispatch queue.
The new scheduler appears to be uncontroversial - though it clearly is not
a critical fix and thus won't go into 2.6.0. The real debate appears to be
over how I/O priorities should be controlled. Some commenters would like
to see the nice() system call apply to I/O priorities as well as
CPU priorities. That, however, would be a fairly fundamental ABI change,
and is unlikely to happen.
Comments (none posted)
As those who have looked at kernel programming at all have noticed, there
are two basic memory allocation modes in Linux. One of those, which comes
down to
get_free_pages() in the end, allocates one or more
physically contiguous pages which are in the kernel's main
virtual address space (except for high memory pages, of course). Most
other memory allocation mechanisms, including the slab
allocator and
kmalloc(), are built on top of
get_free_pages(). In the
other corner is
vmalloc(), which allocates virtually contiguous (but
physically dispersed) pages in a separate virtual address space.
vmalloc() is relatively slow, but it can perform large allocations
that look contiguous to the kernel. It is thus used, for example, to
allocate space for code from loadable modules.
Erik Jacobson recently found the limits of
kmalloc() while querying /proc/interrupts on a very
large system. The code implementing /proc/interrupts attempts to
allocate a buffer for its output; the size of that buffer is dependent on
the number of processors on the system. On big systems, the required
buffer is large and the allocation fails. So Erik submitted a fix which
uses vmalloc() to allocate the memory instead.
Linus didn't like it. He pointed out that
the seq_file interface should
be used instead. Indeed, /proc/interrupts fits naturally into the
sort of output seq_file is intended to create, and doing things that way
can eliminate the need to allocate a large buffer at all. But Linus also
clarified his thoughts on when vmalloc() should be used:
There are basically no valid new uses of it. There's a few valid
legacy users (I think the file descriptor array), and there are
some drivers that use it (which is crap, but drivers are drivers),
and it's _really_ valid only for modules. Nothing else.
That should be sufficiently clear for most readers; perhaps an entry on
vmalloc() needs to be added to the coding style document.
There are a few reasons for this stance. Every call to vmalloc()
requires page table tweaking and translation buffer flushes, so it will be
slow. Space from vmalloc() lies outside of the regular kernel
range, which is (on most architectures) covered by a single, large page
table entry, so extra translation buffer slots are required to access it.
And, on many architectures, the amount of virtual space set aside for
vmalloc() is relatively small. For all of these reasons, use of
vmalloc() is discouraged, and patches containing
vmalloc() calls are increasingly unlikely to make it into the
kernel.
Comments (none posted)
The BK2CVS repository (which contains a CVS copy of Linus's public
BitKeeper repository) has been offline for a bit due to the backdoor
insertion attempt. When it returns, it may come back without the "pserver"
access mode which is normally used for anonymous CVS updates. Pserver is
convenient, but it increases the security exposure of the CVS repository
and it is not supported by the kernel.org mirror system. Given that a very
small number of people have been using that access mode, there seems to be
a consensus that it can just go away.
People do use the CVS repository, however. It just turns out that
many of them have noted that it is faster to use rsync to
update the entire repository from a kernel.org mirror than to update it
through CVS. The rsync approach looks like the way to go in the future,
but it does have one potential difficulty: if the repository is updated in
the middle of an rsync, the person downloading the copy might get an
inconsistent tree. Kernel hackers have to deal with enough race conditions
as it is; they would prefer not to encounter them while trying to update
their copy of the mainline kernel repository.
The solution that is likely to be
implemented involves the creation of a couple of sequence files. One is
fetched before doing the big repository rsync, and the other afterward. If
the sequence numbers in the two files do not match, the rsync operation
raced with an update of the repository and needs to be retried. This is,
of course, an Internet implementation of the seqlock algorithm used within the kernel. Look
for an update script to show up soon.
Comments (none posted)
Driver porting
The LWN
Porting Drivers to 2.6
series is currently going through an extensive review. Since the first
set of articles came out last February, quite a few things have changed and
a number of the articles have become a little stale. Trying to keep up
with the kernel is like that... The updating process is a little over
halfway complete as of this writing; we should be able to finish within a
week or so.
Most of the articles require only small changes at most. The "creating
virtual filesystems with libfs" article, however, has been significantly
expanded, thanks to the addition of simple_fill_super(). For
those who are curious, the newer, bigger version of the article appears
below.
Comments (none posted)
[This article has been reworked to reflect changes in the libfs interface;
those who are interested can still read
the
original version.]
Linus and numerous other kernel developers dislike the ioctl()
system call, seeing it as an uncontrolled way of adding new system calls to
the kernel. Putting new files into /proc is also discouraged,
since that area is seen as being a bit of a mess. Developers who populate
their code with ioctl() implementations or /proc files
are often encouraged to create a standalone virtual filesystem instead.
Filesystems make the interface explicit and visible in user space; they
also make it easier to write scripts which perform administrative
functions. But the writing of a Linux filesystem can be an intimidating
task. A developer who has spent some time just getting up to speed on the
driver interface can be forgiven for balking at having to learn the VFS API
as well.
The 2.6 kernel contains a set of routines called
"libfs" which is designed to make the task of writing virtual filesystems
easier. libfs handles many of the mundane tasks of implementing the Linux
filesystem API, allowing non-filesystem developers to concentrate (mostly)
on the specific functionality they want to provide. What it lacks,
however, is documentation. This article is an attempt to fill in that gap
a little bit.
The task we will undertake is not particularly ambitious: export a simple
filesystem (of type "lwnfs") full of counter files. Reading one of these
files yields the current value of the counter, which is then incremented.
This leads to the following sort of exciting interaction:
# cat /lwnfs/counter
0
# cat /lwnfs/counter
1
# ...
Your author was able to amuse himself well into the thousands this way;
some users may tire of this game sooner, however. The impatient can get to
higher values more quickly by writing to the counter file:
# echo 1000 > /lwnfs/counter
# cat /lwnfs/counter
1000
#
OK, so the Linux distributors will probably not get to excited about
advertising the new "lwnfs" capability. But it works as a way of showing
how to create virtual filesystems. For those who are interested, the full source is available.
Initialization and superblock setup
So let's get started.
A loadable module which implements a filesystem must, at load time,
register that filesystem with the VFS layer. The lwnfs module
initialization code is simple:
static int __init lfs_init(void)
{
return register_filesystem(&lfs_type);
}
module_init(lfs_init);
The lfs_type argument is a structure which is set up as follows:
static struct file_system_type lfs_type = {
.owner = THIS_MODULE,
.name = "lwnfs",
.get_sb = lfs_get_super,
.kill_sb = kill_litter_super,
};
This is the basic data structure which describes a filesystem type to the
kernel; it is declared in <linux/fs.h>. The owner
field is used to manage the module's reference count, preventing unloading
of the module while the filesystem code is in use. The name is
what eventually ends up on a mount command line in user space.
Then there are two functions for managing the filesystem's superblock - the
root of the filesystem data structure. kill_litter_super() is a
generic function provided by the VFS; it simply cleans up all of the
in-core structures when the filesystem is unmounted; authors of simple
virtual filesystems need not worry about this aspect of things. (It
is necessary to unregister the filesystem at unload time, of course;
see the source for the lwnfs exit function).
In many cases, the creation of the superblock must be done by the filesystem
programmer -- but see the "a simpler way" section below. This task
involves a bit of
boilerplate code. In this case, lfs_get_super() hands off the task
as follows:
static struct super_block *lfs_get_super(struct file_system_type *fst,
int flags, const char *devname, void *data)
{
return get_sb_single(fst, flags, data, lfs_fill_super);
}
Once again, get_sb_single() is generic code which handles much of
the superblock creation task. But it will call lfs_fill_super(),
which performs setup specific to our particular little filesystem. It's
prototype is:
static int lfs_fill_super (struct super_block *sb,
void *data, int silent);
The in-construction superblock is passed in, along with a couple of other
arguments that we can ignore. We do have to fill in some of the superblock
fields, though. The code starts out like this:
sb->s_blocksize = PAGE_CACHE_SIZE;
sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
sb->s_magic = LFS_MAGIC;
sb->s_op = &lfs_s_ops;
Most virtual filesystem implementations have something that looks like this;
it's just setting up the block size of the filesystem, a "magic number" to
recognize superblocks by, and the superblock operations. These operations
need not be written for a simple virtual filesystem - libfs has the stuff
that is needed. So lfs_s_ops is defined (at the top file level) as:
static struct super_operations lfs_s_ops = {
.statfs = simple_statfs,
.drop_inode = generic_delete_inode,
};
Creating the root directory
Getting back into
lfs_fill_super(), our big remaining task
is to create and populate the root directory for our new filesystem. The
first step is to create the inode for the directory:
root = lfs_make_inode(sb, S_IFDIR | 0755);
if (! root)
goto out;
root->i_op = &simple_dir_inode_operations;
root->i_fop = &simple_dir_operations;
lfs_make_inode() is a boilerplate function that we will look at
eventually; for now, just assume that it returns a new, initialized inode
that we can use. It needs the superblock and a mode argument,
which is just like the mode value returned by the stat() system
call. Since we passed S_IFDIR, the
returned inode will describe a directory. The file and directory
operations that we assign to this inode are, again, taken from libfs.
This directory inode must be put into
the directory cache (by way of a "dentry" structure)
so that the VFS can find it; that is done as follows:
root_dentry = d_alloc_root(root);
if (! root_dentry)
goto out_iput;
sb->s_root = root_dentry;
Creating files
The superblock now has a fully initialized root directory. All of the
actual directory operations will be handled by libfs and the VFS layer, so
life is easy.
What libfs cannot do, however, is actually put anything of interest into
that root directory – that's our job. So the final thing that
lfs_fill_super() does before returning is to call:
lfs_create_files(sb, root_dentry);
In our sample module, lfs_create_files() creates one counter file
in the root directory of the filesystem, and another in a subdirectory.
We'll look mostly at the root-level file.
The counters are implemented as atomic_t
variables; our top-level counter (called, with great imagination,
"counter") is set up as follows:
static atomic_t counter;
static void lfs_create_files (struct super_block *sb,
struct dentry *root)
{
/* ... */
atomic_set(&counter, 0);
lfs_create_file(sb, root, "counter", &counter);
/* ... */
}
lfs_create_file does the real work of making a file in a
directory. It has been made about as simple as possible, but there are
still a few steps to be performed. The function starts out as:
static struct dentry *lfs_create_file (struct super_block *sb,
struct dentry *dir, const char *name,
atomic_t *counter)
{
struct dentry *dentry;
struct inode *inode;
struct qstr qname;
Arguments include the usual superblock structure, and dir, the
dentry for the directory that will contain this file. In this case,
dir will be the root directory we created before, but it could be
any directory within the filesystem.
Our first task is to create a directory entry for the new file:
qname.name = name;
qname.len = strlen (name);
qname.hash = full_name_hash(name, qname.len);
dentry = d_alloc(dir, &qname);
The setting up of qname just hashes the file name so that it can be
found quickly in the dentry cache. Once that's done, we create the entry
within our parent dir. The file also needs an inode, which we
create as follows:
inode = lfs_make_inode(sb, S_IFREG | 0644);
if (! inode)
goto out_dput;
inode->i_fop = &lfs_file_ops;
inode->u.generic_ip = counter;
Once again, we call lfs_make_inode (which we will look at shortly,
honest), but this time we use it to create a regular file. The key to the
creation of special-purpose files in virtual filesystems is to be found in
the other two assignments:
- The i_fop field is set up with our file operations which will
actually implement reads and writes on the counter.
- We use the u.generic_ip pointer in the inode to stash aside a
pointer to the atomic_t counter associated with this file.
In other words, i_fop defines the behavior of this particular
file, and u.generic_ip is the file-specific data. All virtual
filesystems of interest will make use of these two fields to set up the
required behavior.
The last step in creating a file is to add it to the dentry cache:
d_add(dentry, inode);
return dentry;
Putting the inode into the dentry cache allows the VFS to find the file
without having to consult our filesystem's directory operations. And that,
in turn, means our filesystem does not need to have any directory
operations of interest. The entire structure of our virtual filesystem
lives in the kernel's cache structure, so our module need not remember the
structure of the filesystem it has set up, and it need not implement a
lookup operation. Needless to say, that makes life easier.
Inode creation
Before we get into the actual implementation of the counters, it's time to
look at
lfs_make_inode(). The function is pure boilerplate; it
looks like:
static struct inode *lfs_make_inode(struct super_block *sb, int mode)
{
struct inode *ret = new_inode(sb);
if (ret) {
ret->i_mode = mode;
ret->i_uid = ret->i_gid = 0;
ret->i_blksize = PAGE_CACHE_SIZE;
ret->i_blocks = 0;
ret->i_atime = ret->i_mtime = ret->i_ctime = CURRENT_TIME;
}
return ret;
}
It simply allocates a new inode structure, and fills it in with values that
make sense for a virtual file. The assignment of mode is of
interest; the resulting inode will be a regular file or a directory (or
something else) depending on how mode was passed in.
Implementing file operations
Up to this point, we have seen very little that actually makes the counter
files work; it's all been VFS boilerplate so that we have a little
filesystem to put those counters into. Now the time has come to see how
the real work gets done.
The operations on the counters
themselves are to be found in the file_operations structure that
we associate with the counter file inodes:
static struct file_operations lfs_file_ops = {
.open = lfs_open,
.read = lfs_read_file,
.write = lfs_write_file,
};
A pointer to this structure, remember, was stored in the inode by
lfs_create_file().
The simplest operation is open():
static int lfs_open(struct inode *inode, struct file *filp)
{
filp->private_data = inode->u.generic_ip;
return 0;
}
The only thing this function need do is copy the pointer to the
atomic_t pointer over into the file structure, which
makes it a bit easier to get at.
The interesting work is done by the read() function, which must
increment the counter and return its value to the user space program. It
has the usual read() operation prototype:
static ssize_t lfs_read_file(struct file *filp, char *buf,
size_t count, loff_t *offset)
It starts by reading and incrementing the counter:
atomic_t *counter = (atomic_t *) filp->private_data;
int v = atomic_read(counter);
atomic_inc(counter);
This code has been simplified a bit; see the module source for a couple of
grungy, irrelevant details. Some readers will also notice a race condition
here: two processes could read the counter before either increments it; the
result would be the same counter value returned twice, with certain dire
results. A serious module would probably serialize access to the counter
with a spinlock. But this is supposed to be a simple demonstration.
So anyway, once we have the value of the counter, we
have to return it to user space. That means encoding it into character
form, and figuring out where and how it fits into the user-space buffer.
After all, a user-space program can seek around in our virtual file.
len = snprintf(tmp, TMPSIZE, "%d\n", v);
if (*offset > len)
return 0;
if (count > len - *offset)
count = len - *offset;
Once we've figured out how much data we can copy back, we just do it,
adjust the file offset, and we're done.
if (copy_to_user(buf, tmp + *offset, count))
return -EFAULT;
*offset += count;
return count;
Then, there is lfs_write_file(), which allows a user to set the
value of one of our counters:
static ssize_t lfs_write_file(struct file *filp, const char *buf,
size_t count, loff_t *offset)
{
atomic_t *counter = (atomic_t *) filp->private_data;
char tmp[TMPSIZE];
if (*offset != 0)
return -EINVAL;
if (count >= TMPSIZE)
return -EINVAL;
memset(tmp, 0, TMPSIZE);
if (copy_from_user(tmp, buf, count))
return -EFAULT;
atomic_set(counter, simple_strtol(tmp, NULL, 10));
return count;
}
That is just about it. The module also defines lfs_create_dir,
which creates a directory in the filesystem; see the full source for how
that works.
A simpler way
The above example contains a great deal of scary-looking boilerplate code.
That boilerplate will be necessary for many applications, but there is a
shortcut that will work for many others. If you know at compile time which
files you wish to create, and you do not need to make subdirectories, read
on for the easier way.
In this section, we'll talk about a different version of the lwnfs module -
one which eliminates about 1/3 of the code. It implements a simple array
of four counters, with no subdirectories. Once again, full source is available if you are interested.
Above, we looked at a function called lfs_fill_super(), which
fills in the filesystem superblock, creates the root directory, and
populates it with files. In the simpler version, the entire function
becomes the following:
static int lfs_fill_super(struct super_block *sb, void *data, int silent)
{
return simple_fill_super(sb, LFS_MAGIC, OurFiles);
}
simple_fill_super() is a libfs function which does almost
everything we need. Its actual prototype is:
int simple_fill_super(struct super_block *sb, int magic,
struct tree_descr *files);
The struct super_block argument can be passed directly through,
and magic is the same magic number we saw above. The
files argument describes which files should be created in the
filesystem; the relevant structure is defined as follows:
struct tree_descr {
char *name;
struct file_operations *ops;
int mode;
};
The arguments should be fairly obvious by now; each structure gives the
name of the file to be created, the file operations to associate with the
file, and the protection bits for the file. There are, however, a couple
of quirks about how the array of tree_descr structures should be
built:
- Entries which are filled with NULLs (more strictly, where
name is NULL) are simply ignored. Do not try to end
the list with a NULL-filled structure, unless you like
decoding oops listings.
- The list is terminated, instead, by an entry that sets name
to the empty string.
- The entries correspond directly to the inode numbers which will be
assigned to the resulting files. This knowledge can be used to figure
out, in the file operations code, which file is being opened. But
this feature also implies that the first entry in the list cannot be
used, since the filesystem root directory will take inode zero. So,
when you create your tree_descr list, the first entry should
be NULL.
Having painfully learned all of the above, your author has set up the list
for the four "counter" files as follows:
static struct tree_descr OurFiles[] = {
{ NULL, NULL, 0 }, /* Skipped */
{ .name = "counter0", /* Inode 1 */
.ops = &lfs_file_ops,
.mode = S_IWUSR|S_IRUGO },
{ .name = "counter1", /* Inode 2 */
.ops = &lfs_file_ops,
.mode = S_IWUSR|S_IRUGO },
{ .name = "counter2", /* Inode 3 */
.ops = &lfs_file_ops,
.mode = S_IWUSR|S_IRUGO },
{ .name = "counter3", /* Inode 4 */
.ops = &lfs_file_ops,
.mode = S_IWUSR|S_IRUGO },
{ "", NULL, 0 } /* Terminates the list */
};
Once the call to simple_fill_super() returns, the work is done and
your filesystem is live. The only remaining detail might be in your
open() method; if you have multiple files sharing the same
file_operations structure, you will need to figure out which one
is actually being acted upon. The key here is the inode number, which
can be found in the i_ino field. The modified version of
lfs_open() finds the right counter as follows:
static int lfs_open(struct inode *inode, struct file *filp)
{
if (inode->i_ino > NCOUNTERS)
return -ENODEV; /* Should never happen. */
filp->private_data = counters + inode->i_ino - 1;
return 0;
}
The read() and write() functions use the
private_data field, and thus need not be modified from the
previous version.
Conclusion
The libfs code, as demonstrated here, is sufficient for a wide variety of
driver-specific virtual filesystems. Further examples can be found in the
2.5 kernel source in a few places:
- drivers/hotplug/pci_hotplug_core.c
- drivers/usb/core/inode.c
- drivers/oprofile/oprofilefs.c
- fs/ramfs/inode.c
- fs/nfsd/nfsctl.c (simple_fill_super() example)
...and in a few other spots – grep is your friend.
Keep in mind
that the 2.6 driver model code makes it easy for drivers to export
information within its own virtual filesystem; for many applications, that
will be the preferred way of making information available to user space.
The Driver Porting Series has
several articles on the driver model and sysfs.
For cases where only a custom filesystem will do, however, libfs makes the
task (relatively) easy.
Comments (4 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>