Creating Linux virtual filesystems
This article is part of the LWN Porting Drivers to 2.5 series. |
The 2.6 kernel, as of the 2.5.7 release, contains a set of routines called "libfs" which is designed to make the task of writing virtual filesystems easier. libfs handles many of the mundane tasks of implementing the Linux filesystem API, allowing non-filesystem developers to concentrate (mostly) on the specific functionality they want to provide. What it lacks, however, is documentation. Your author decided to take a little time away from subscription management code to play a bit with libfs; the following describes the basics of how to use this facility.
The task I undertook was not particularly ambitious: export a simple filesystem (of type "lwnfs") full of counter files. Reading one of these files yields the current value of the counter, which is then incremented. This leads to the following sort of exciting interaction:
# cat /lwnfs/counter 0 # cat /lwnfs/counter 1 # ...
Your author was able to amuse himself well into the thousands this way; some users may tire of this game sooner, however. The impatient can get to higher values more quickly by writing to the counter file:
# echo 1000 > /lwnfs/counter # cat /lwnfs/counter 1000 #
OK, so it's not going to be at the top of the list of things for Linus to merge once he returns, tanned, rested, and ready, from his Caribbean cruise, but it's OK as a way of showing the simplest possible filesystem. Numerous code samples will be shown below; the full module is also available on this page.
Initialization and superblock setup
So let's get started. A loadable module which implements a filesystem must, at load time, register that filesystem with the VFS layer. The lwnfs module initialization code is simple:
static int __init lfs_init(void) { return register_filesystem(&lfs_type); } module_init(lfs_init);
The lfs_type argument is a structure which is set up as follows:
static struct file_system_type lfs_type = { .owner = THIS_MODULE, .name = "lwnfs", .get_sb = lfs_get_super, .kill_sb = kill_litter_super, };
This is the basic data structure which describes a filesystem time to the kernel; it is declared in <linux/fs.h>. The owner field is used to manage the module's reference count, preventing unloading of the module while the filesystem code is in use. The name is what eventually ends up on a mount command line in user space. Then there are two functions for managing the filesystem's superblock - the root of the filesystem data structure. kill_litter_super() is a generic function provided by the VFS; it simply cleans up all of the in-core structures when the filesystem is unmounted; authors of simple virtual filesystems need not worry about this aspect of things. (It is necessary to unregister the filesystem at unload time, of course; see the source for the lwnfs exit function).
The creation of the superblock must be done by the filesystem programmer. The task has gotten simpler, but still involves a bit of boilerplate code. In this case, lfs_get_super() hands off the task as follows:
static struct super_block *lfs_get_super(struct file_system_type *fst, int flags, const char *devname, void *data) { return get_sb_single(fst, flags, data, lfs_fill_super); }
Once again, get_sb_single() is generic code which handles much of the superblock creation task. But it will call lfs_fill_super(), which performs setup specific to our particular little filesystem. It's prototype is:
static int lfs_fill_super (struct super_block *sb, void *data, int silent);
The in-construction superblock is passed in, along with a couple of other arguments that we can ignore. We do have to fill in some of the superblock fields, though. The code starts out like this:
sb->s_blocksize = PAGE_CACHE_SIZE; sb->s_blocksize_bits = PAGE_CACHE_SHIFT; sb->s_magic = LFS_MAGIC; sb->s_op = &lfs_s_ops;
All virtual filesystem implementations have something that looks like this; it's just setting up the block size of the filesystem, a "magic number" to recognize superblocks by, and the superblock operations. These operations need not be written for a simple virtual filesystem - libfs has the stuff that is needed. So lfs_s_ops is defined (at the top file level) as:
static struct super_operations lfs_s_ops = { .statfs = simple_statfs, .drop_inode = generic_delete_inode, };
Creating the root directory
Getting back into lfs_fill_super(), our big remaining task is to create and populate the root directory for our new filesystem. The first step is to create the inode for the directory:
root = lfs_make_inode(sb, S_IFDIR | 0755); if (! root) goto out; root->i_op = &simple_dir_inode_operations; root->i_fop = &simple_dir_operations;
lfs_make_inode() is a boilerplate function that we will look at eventually; for now, just assume that it returns a new, initialized inode that we can use. It needs the superblock and a mode argument, which is just like the mode value returned by the stat() system call. Since we passed S_IFDIR, the returned inode will describe a directory. The file and directory operations that we assign to this inode are, again, taken from libfs.
This directory inode must be put into the directory cache (by way of a "dentry" structure) so that the VFS can find it; that is done as follows:
root_dentry = d_alloc_root(root); if (! root_dentry) goto out_iput; sb->s_root = root_dentry;
Creating files
The superblock now has a fully initialized root directory. All of the actual directory operations will be handled by libfs and the VFS layer, so life is easy. What libfs cannot do, however, is actually put anything of interest into that root directory – that's our job. So the final thing that lfs_fill_super() does before returning is to call:
lfs_create_files(sb, root_dentry);
In our sample module, lfs_create_files() creates one counter file in the root directory of the filesystem, and another in a subdirectory. We'll look mostly at the root-level file. The counters are implemented as atomic_t variables; our top-level counter (called, with great imagination, "counter") is set up as follows:
static atomic_t counter; static void lfs_create_files (struct super_block *sb, struct dentry *root) { /* ... */ atomic_set(&counter, 0); lfs_create_file(sb, root, "counter", &counter); /* ... */ }
lfs_create_file does the real work of making a file in a directory. It has been made about as simple as possible, but there are still a few steps to be performed. The function starts out as:
static struct dentry *lfs_create_file (struct super_block *sb, struct dentry *dir, const char *name, atomic_t *counter) { struct dentry *dentry; struct inode *inode; struct qstr qname;
Arguments include the usual superblock structure, and dir, the dentry for the directory that will contain this file. In this case, dir will be the root directory we created before, but it could be any directory within the filesystem.
Our first task is to create a directory entry for the new file:
qname.name = name; qname.len = strlen (name); qname.hash = full_name_hash(name, qname.len); dentry = d_alloc(dir, &qname);
The setting up of qname just hashes the filename so that it can be found quickly in the dentry cache. Once that's done, we create the entry within our parent dir. The file also needs an inode, which we create as follows:
inode = lfs_make_inode(sb, S_IFREG | 0644); if (! inode) goto out_dput; inode->i_fop = &lfs_file_ops; inode->u.generic_ip = counter;
Once again, we call lfs_make_inode (which we will look at shortly, honest), but this time we use it to create a regular file. The key to the creation of special-purpose files in virtual filesystems is to be found in the other two assignments:
- The i_fop field is set up with our file operations which will
actually implement reads and writes on the counter.
- We use the u.generic_ip pointer in the inode to stash aside a pointer to the atomic_t counter associated with this file.
In other words, i_fop defines the behavior of this particular file, and u.generic_ip is the file-specific data. All virtual filesystems of interest will make use of these two fields to set up the required behavior.
The last step in creating a file is to add it to the dentry cache:
d_add(dentry, inode); return dentry;
Putting the inode into the dentry cache allows the VFS to find the file without having to consult our filesystem's directory operations. And that, in turn, means our filesystem does not need to have any directory operations of interest. The entire structure of our virtual filesystem lives in the kernel's cache structure, so our module need not remember the structure of the filesystem it has set up, and it need not implement a lookup operation. Needless to say, that makes life easier.
Inode creation
Before we get into the actual implementation of the counters, it's time to look at lfs_make_inode(). The function is pure boilerplate; it looks like:
static struct inode *lfs_make_inode(struct super_block *sb, int mode) { struct inode *ret = new_inode(sb); if (ret) { ret->i_mode = mode; ret->i_uid = ret->i_gid = 0; ret->i_blksize = PAGE_CACHE_SIZE; ret->i_blocks = 0; ret->i_atime = ret->i_mtime = ret->i_ctime = CURRENT_TIME; } return ret; }
It simply allocates a new inode structure, and fills it in with values that make sense for a virtual file. The assignment of mode is of interest; the resulting inode will be a regular file or a directory (or something else) depending on how mode was passed in.
Implementing file operations
Up to this point, we have seen very little that actually makes the counter files work; it's all been VFS boilerplate so that we have a little filesystem to put those counters into. Now the time has come to see how the real work gets done.The operations on the counters themselves are to be found in the file_operations structure that we associate with the counter file inodes:
static struct file_operations lfs_file_ops = { .open = lfs_open, .read = lfs_read_file, .write = lfs_write_file, };
A pointer to this structure, remember, was stored in the inode by lfs_create_file().
The simplest operation is open:
static int lfs_open(struct inode *inode, struct file *filp) { filp->private_data = inode->u.generic_ip; return 0; }
The only thing this function need do is move the pointer to the atomic_t pointer over into the file structure, which makes it a bit easier to get at.
The interesting work is done by the read function, which must increment the counter and return its value to the user space program. It has the usual read operation prototype:
static ssize_t lfs_read_file(struct file *filp, char *buf, size_t count, loff_t *offset)
It starts by reading and incrementing the counter:
atomic_t *counter = (atomic_t *) filp->private_data; int v = atomic_read(counter); atomic_inc(counter);
This code has been simplified a bit; see the module source for a couple of grungy, irrelevant details. Some readers will also notice a race condition here: two processes could read the counter before either increments it; the result would be the same counter value returned twice, with certain dire results. A serious module would probably serialize access to the counter with a spinlock. But this is supposed to be a simple demonstration.
So anyway, once we have the value of the counter, we have to return it to user space. That means encoding it into character form, and figuring out where and how it fits into the user-space buffer. After all, a user-space program can seek around in our virtual file.
len = snprintf(tmp, TMPSIZE, "%d\n", v); if (*offset > len) return 0; if (count > len - *offset) count = len - *offset;
Once we've figured out how much data we can copy back, we just do it, adjust the file offset, and we're done.
if (copy_to_user(buf, tmp + *offset, count)) return -EFAULT; *offset += count; return count;
Then, there is lfs_write_file(), which allows a user to set the value of one of our counters:
static ssize_t lfs_write_file(struct file *filp, const char *buf, size_t count, loff_t *offset) { atomic_t *counter = (atomic_t *) filp->private_data; char tmp[TMPSIZE]; if (*offset != 0) return -EINVAL; if (count >= TMPSIZE) return -EINVAL; memset(tmp, 0, TMPSIZE); if (copy_from_user(tmp, buf, count)) return -EFAULT; atomic_set(counter, simple_strtol(tmp, NULL, 10)); return count; }
That is just about it. The module also defines lfs_create_dir, which creates a directory in the filesystem; see the full source for how that works.
Conclusion
The libfs code, as demonstrated here, is sufficient for a wide variety of driver-specific virtual filesystems. Further examples can be found in the 2.5 kernel source in a few places:- drivers/hotplug/pci_hotplug_core.c
- drivers/usb/core/inode.c
- drivers/oprofile/oprofilefs.c
- fs/ramfs/inode.c
...and in a few other spots – grep is your friend.
Keep in mind,
that the 2.5 driver model code makes it easy for drivers to export
information within its own virtual filesystem; for many applications, that
will be the preferred way of making information available to user space.
For cases where only a custom filesystem will do, however, libfs makes the
task (relatively) easy.
Posted Oct 22, 2002 19:23 UTC (Tue)
by gregkh (subscriber, #8)
[Link] (3 responses)
Posted Oct 22, 2002 19:50 UTC (Tue)
by corbet (editor, #1)
[Link] (2 responses)
Posted Oct 22, 2002 22:13 UTC (Tue)
by gregkh (subscriber, #8)
[Link]
Posted Oct 23, 2002 0:41 UTC (Wed)
by roelofs (guest, #2599)
[Link]
Greg
Posted Oct 22, 2002 20:29 UTC (Tue)
by smoogen (subscriber, #97)
[Link] (3 responses)
Posted Oct 22, 2002 20:32 UTC (Tue)
by corbet (editor, #1)
[Link] (2 responses)
Posted Oct 23, 2002 4:38 UTC (Wed)
by smoogen (subscriber, #97)
[Link] (1 responses)
While not on the page it would be interesting to see how the country breakdown is with something like: % of subscriptions % of visitors (derived from IP) Now that is something to be nationalistic about :)
Posted Oct 23, 2002 10:04 UTC (Wed)
by veelo (guest, #4694)
[Link]
> European Union Bastiaan.
Posted Oct 23, 2002 6:05 UTC (Wed)
by scottt (guest, #5028)
[Link] (3 responses)
Posted Oct 23, 2002 12:34 UTC (Wed)
by corbet (editor, #1)
[Link] (2 responses)
Posted Oct 24, 2002 2:11 UTC (Thu)
by brugolsky (guest, #28)
[Link]
I'm beginning to think about this because I want to start using Ron Minnich's implementation of 9P (v9fs.sourceforge.net) for various development and admin tasks. Great article Jon -- once Al Viro adds union-mount, may a thousand
Posted Oct 24, 2002 20:20 UTC (Thu)
by brouhaha (subscriber, #1698)
[Link]
AFAICT, that would allow non-privileged users to play with their namespace all they want, without compromising system integrity.
Posted Oct 23, 2002 14:59 UTC (Wed)
by nick.leroy (guest, #109)
[Link] (1 responses)
However, if I'm not mistaken, isn't the following line incorrect, or did I miss something? Once again, great article. -Nick
Posted Oct 23, 2002 15:16 UTC (Wed)
by corbet (editor, #1)
[Link]
Posted Oct 23, 2002 16:50 UTC (Wed)
by airwin (guest, #6920)
[Link] (4 responses)
Actually, I didn't like this article as much as the other posters here. To my mind the balance between introductory material (which most LWN readers can understand) and technical details (which most LWN readers cannot understand) is off. Here are my two suggestions: (1) I suggest you should have included the same technical details, but hide them in the link to your overall source code for the small subset of your readers who are interested. After all, the main point here is that libfs creates a boilerplate way of creating your own special file system; the details of the boilerplate are not that interesting until the day you actually want to create your own filesystem. (2) Please expand the introductory and concluding material. The introduction gives the impression that we might end up with hundreds of special file systems in the near future, and I believe that would be an even worse mess than the current /proc. Your conclusion then does give a tantalizing indication that there are other ways to export information than a special file system, but this needs expansion.
Posted Oct 23, 2002 18:38 UTC (Wed)
by corbet (editor, #1)
[Link] (2 responses)
That is a definite risk; the Linux system of the future could well have management issues that we don't see now.
The device model was covered in some detail back in August.
Posted Oct 23, 2002 19:40 UTC (Wed)
by airwin (guest, #6920)
[Link]
Posted Oct 24, 2002 7:42 UTC (Thu)
by ekj (guest, #1524)
[Link]
:-)
Posted Oct 24, 2002 22:40 UTC (Thu)
by esnyder (guest, #6987)
[Link]
Count me in as voting for continued 'special interest' articles when time permits.
Posted Oct 29, 2002 19:32 UTC (Tue)
by erich (guest, #7127)
[Link] (2 responses)
Greetings, P.S. Thanks to HP, LWN and Bdale for the Debian Group Subscription.
Posted Mar 22, 2004 13:31 UTC (Mon)
by mojozoox (guest, #20372)
[Link] (1 responses)
mkfs -t option does not seem to work with the lwnfs okay! could somebody help on how to go about it to associate it to a device and then mount it to mountpoint.
Posted Apr 1, 2004 20:12 UTC (Thu)
by domenpk (guest, #12382)
[Link]
Posted Mar 2, 2006 18:48 UTC (Thu)
by gjpc (guest, #36243)
[Link]
I have a questions about mounted file systems within file systems.
Let's say within a ext3 root file system I create a file system foo at mount point /fooRoot.
I then create a node within the foo file system fooChild.
I then create sub node to fooChild, fooGrand.
Now I mount a ext2 file system on node fooGrand.
When a user tries to fopen( "/fooRoot/fooChild/fooGrand/someFile", "r" ) will that open request somhow be cleared through the foo file system before handed to the ext2 file system?
Posted Apr 4, 2012 5:24 UTC (Wed)
by crxz0193 (guest, #75555)
[Link] (1 responses)
static struct file_system_type lfs_type = {
Posted Apr 4, 2012 6:11 UTC (Wed)
by viro (subscriber, #7872)
[Link]
1) return value of mount_single() is struct dentry *; so's that of ->mount(). IOW, the body is correct, but declaration isn't - it should return struct dentry *, not struct super_block *.
2) use d_make_root() instead of d_alloc_root(); cleanup is easier with that one (and d_alloc_root() will be gone in 3.4 anyway). In this case it becomes simply
root = lfs_make_inode (sb, S_IFDIR | 0755);
3) simple_read_from_buffer() is your friend. Instead of
len = snprintf(tmp, TMPSIZE, "%d\n", v);
just do
len = snprintf(tmp, TMPSIZE, "%d\n", v);
and be done with that.
4) no need to open-code d_alloc_name(). Or simple_mkdir(), for that matter...
Posted Feb 21, 2019 2:34 UTC (Thu)
by lumotwe (guest, #111752)
[Link]
Nice article.Creating Linux virtual filesystems
Unfortunatly, your module is not able to be unloaded, right (I'm guessing as I haven't tried the code myself)? Problem is that nothing is removing the files that you have created, forcing the module count to always remain positive.
To work around this, see the gyrations I had to do with get_mount and put_mount in the drivers/hotplug/pci_hotplug.c and drivers/usb/core/inode.c code.
And patches to help port those two files to use libfs more are greatly appreciated :)
Actually, it unloads just fine. Trust me, I loaded/unloaded it a lot of times before I was ready to write anything about it... Only had to reboot once, though :). I haven't followed it too far, but my understanding is that kill_litter_super() cleans up all of that junk at unmount time.
Creating Linux virtual filesystems
Ah, missed the kill_litter_super() reference, nice.Creating Linux virtual filesystems
But does this mean that if you unmount the fs, and then mount it again, the "counter" file is created from scratch, with a initialized value?
If so, this might not be what you want for a fs that is tied to a driver. You might want to keep around the files between mounts, and not be forced to regenerate them all every mount time. This can be seen in usbfs where the files are created when a device is added or removed from the system. We don't walk all devices at mount time, although that might not be a bad idea...
Anyway, this is a minor point, very nice article, it matches my upcoming linux.conf.au talk and paper quite well :)
Oh, and for doing something like this for the 2.4 kernel, this article might help out a bit.
So you wrote this article, Jonathan? That was a tad difficult to divine from the
actual posting, even though you were the most likely suspect (as Kernel
page editor and all)... Perhaps a subtitle or closing block or
something with an attribution in the future? (Don't be shy about
tooting your own horn, eh? You might consider naming a few names on the
About page, too.)
Creating Linux virtual filesystems
Really nice article. It helps me on looking at something different today. So how many more subscribers are needed for your employment needs?
Creating Linux virtual filesystems
Last week, when we were just short of 2000, we set a goal of 4000 as a relatively stable place to be somewhere not too far into next year at the latest. We up to almost 2100 now, so progress is happening...but the rate of new subscriptions is (not surprisingly) dropping off; it's probably going to take some serious work to get that doubling.
Subscriber needs
Thanks for the info. I think that having the number on the front page would be a useful thing. This is a new way of doing things and so people might start understanding that they cant just wait for others to support you. Subscriber needs
United States
European Union
Japan
Canada
Mexico
Agreed!Subscriber needs
Make that Europe; it will take a couple of decades (or eternity) before all European countries have joined the Union... Maybe we should just stick to continents, I suspect LWN readers are spread over too many countries.
On a somewhat related topic, I was under the impression that in 2.5 the VFS supports per process namespaces so a user without root priviledges can mount filesystems at will. Can someone confirm this ?
per process filesystem namespace
2.5 has per-process namespaces, allowing the administrator to set up completely different views of the filesystem for different tasks. This capability remains restricted to root, though. If any user could set up any namespace they wanted, there would be a thousand ways to confuse setuid programs and take over the system.
per process filesystem namespace
Al Viro also snuck it into 2.4.19. :-) It ought to be possible to allow non-root mounts on mount points where the user has write permission. As Jon noted, letting the user mount over, e.g., /etc/passwd, is incompatible with setuid executables.per process filesystem namespace
mini filesystems bloom. :-P
There's a simple solution to that: if a setuid program gets loaded when there is a per-process namespace active, the kernel can ignore the setuid bit and run it with no privileges.per process filesystem namespace
Nice, informative article. Great!Creating Linux virtual filesystems
inode->u.generic_ip = counter;
Shouldn't it be (note the pointer to counter):
inode->u.generic_ip = &counter;
No, the code is correct - lfs_create_file already gets a pointer to the counter as an argument, there's no need to indirect it again.
Creating Linux virtual filesystems
Creating Linux virtual filesystems
Sorry you didn't like it...it was meant to be a far more technical piece than usually appears on the Kernel Page. The technical level of that page isn't changing, but I think it's worth putting in some more hardcore stuff occasionally.
Creating Linux virtual filesystems
"The introduction gives the impression that we might end up with hundreds of special file systems in the near future, and I believe that would be an even worse mess than the current /proc. "
"Your conclusion then does give a tantalizing indication that there are other ways to export information than a special file system, but this needs expansion."
Thanks for that additional link.
Creating Linux virtual filesystems
Well a filesystem that lives in its own module doesn't bother anyone who doesn't load it. "lwnfs" now exists. It bothers noone. Those who want their linux-box to be able to count in kernel-mode can insmod it, the rest of us can go on our merry way.Creating Linux virtual filesystems
I don't know; I liked the extra technical information, and would not really change the amount of code inline in the article. However, I did find myself wishing for a conclusion that offered some details on the kinds of (non-trivial) things that one might implement as virtual filesystems. While the counter-per-file was great to give a quick overview of how to implement something, I'm not sure I really got the big picture.Split the difference
I'd like to see some experience reviews of replicating filesystems (preferrably based on such virtual filesystems) but also on Coda, OpenAFS, Intermezzo, SFS (which is non replicating, but a nice secure network file system).Creating Linux virtual filesystems
Especially on such issues like reliable locking, ease of setup, speed of replication etc. I'd love to have a mail server on a replicating set of servers (using Maildir of course, not mbox). Preferably capable of disconnected operation - having one mail server in the US, one in Europe would be cool...
Erich
Ya alright i managed to tweak the filesystem to run on linux 2.4. But im unable to mount it ...Creating Linux virtual filesystems
You are not supposed to mkfs it, or associate it with a device, just "mount -t lwnfs none /some/where" should work
Creating Linux virtual filesystems
Thanks very much for this helpful article.Creating Linux virtual filesystems a Question
Creating Linux virtual filesystems
/*
* Stuff to pass in when registering the filesystem.
*/
static struct super_block *lfs_get_super(struct file_system_type *fst,
int flags, char *devname, void *data)
{
/* return get_sb_single(fst, flags, data, lfs_fill_super, mnt);*/
return mount_single(fst, flags, data, lfs_fill_super);
}
.owner = THIS_MODULE,
.name = "lwnfs",
.mount = lfs_get_super,
/*.get_sb = lfs_get_super,*/
.kill_sb = kill_litter_super,
};
Creating Linux virtual filesystems
if (root) {
root->i_op = &simple_dir_inode_operations;
root->i_fop = &simple_dir_operations;
}
sb->s_root = d_make_root(root);
if (!sb->s_root)
return -ENOMEM;
lfs_create_files(sb, sb->s_root);
return 0;
if (*offset > len)
return 0;
if (count > len - *offset)
count = len - *offset;
/*
* Copy it back, increment the offset, and we're done.
*/
if (copy_to_user(buf, tmp + *offset, count))
return -EFAULT;
*offset += count;
return count;
return simple_read_from_buffer(buf, count, offset, tmp, len);
Creating Linux virtual filesystems