LWN.net Logo

Sysfs and namespaces

By Jonathan Corbet
August 26, 2008
Support for network namespaces - allowing different groups of processes to have a different view of the system's network interfaces, routes, firewall rules, etc. - is nearing completion in recent kernels. A look at net/Kconfig turns up something interesting, though: network namespaces can only be enabled in kernels which do not support sysfs - the two are mutually exclusive. Since most system configurations need sysfs to work properly, this limitation has made it harder than it would otherwise be to use, or even test, the network namespace feature.

The problem is simple: the network subsystem creates a sysfs directory for each network interface on the system. For example, eth0 is represented by /sys/class/net/eth0; therein one can find out just about anything about how eth0 is configured, query its packet statistics, and more. But, when network namespaces are in use, one group of processes may have a different eth0 than another, so they cannot share a globally-accessible sysfs tree. One solution might be to add the network namespace as an explicit level in the sysfs tree, but that would break user-space tools and fails to properly isolate namespaces from each other. The real solution is to build namespace awareness more deeply into sysfs.

Eric Biederman has been working on a set of sysfs namespace patches for the last year or so; those patches now appear to be getting close to ready for inclusion into the mainline. Network namespaces will be the first user of this feature, but it has been written in a way that makes it possible for any system namespace to provide differing views of parts of the sysfs hierarchy.

The core concept is that of a "tagged" directory in sysfs. Any sysfs directory can be associated with (at most) one type of tag, where that type identifies which type of namespace controls how that directory is viewed. Thus, for example, /sys/class/net would have a tag type identifying the network namespace subsystem as the one which is in control there. The /sys/kernel/uids directory, instead, will be managed by the user namespace subsystem. Once a directory is given a tag type, all subdirectories and attribute files inherit the same type.

Namespace code makes use of tagged sysfs directories by adding an entry to enum sysfs_tag_type, defined in <linux/sysfs.h>, to identify its specific tag type. The namespace must also create an operations structure:

    struct sysfs_tag_type_operations {
	const void *(*mount_tag)(void);
    };

The purpose of the mount_tag() method is to return a specific tag (represented by a void * pointer) for the current process. This tag, normally, will just be a pointer to the structure describing the relevant namespace; for example, network namespaces implement this method as follows:

    static const void *net_sysfs_mount_tag(void)
    {
	return current->nsproxy->net_ns;
    }

The tag operations must be registered with sysfs using:

    int sysfs_register_tag_type(enum sysfs_tag_type type, 
                                struct sysfs_tag_type_operations *ops);

Thereafter, there are two ways of associating tags with a sysfs hierarchy. One of those is to make a tagged directory directly with:

    int sysfs_make_tagged_dir(struct kobject *kobj, 
                              enum sysfs_tag_type type);

The directory associated with kobj will have differing contents depending on the value of the tag of the given type. The actual tag associated with the contents of this directory will be determined (at creation time) by calling a new function added to the kobj_type structure:

    const void *(*sysfs_tag)(struct kobject *kobj);

The sysfs_tag() function is usually a short series of container_of() calls which, eventually, locates the appropriate namespace for the given kobj.

An alternative way to attach tags to a directory tree is to associate it directly with the class structure. To that end, struct class has two new members:

    enum sysfs_tag_type tag_type;
    const void *(*sysfs_tag)(struct device *dev);

When the class is instantiated, it will have tags of the given tag_type; the specific tag for a given class will be found by calling the sysfs_tag() function.

Finally, if a specific tag ceases to be valid (because the associated namespace is destroyed, normally), a call should be made to:

    void sysfs_exit_tag(enum sysfs_tag_type type, const void *tag);

This call will cause all sysfs directories with the given tag to become invisible, and to be deleted when it is safe to do so.

Adding tagged directory support requires some significant changes to the sysfs code. But the interface has been designed to make it very easy for other subsystems to make use of tagged directories; it's a simple matter of providing functions to return the specific tag values which should be used. At this point, the biggest challenge might be making sense of sysfs when its contents may be different for each observer. But that is a challenge associated with namespaces in general.


(Log in to post comments)

Sysfs and namespaces

Posted Aug 28, 2008 9:26 UTC (Thu) by liljencrantz (subscriber, #28458) [Link]

Uninformed question:

Roughly how close are we to having fully working, usable namespaces in mainline kernel?

By «fully working, usable» I mean a setup where you can run multiple fake operating systems under the same actual kernal, each one with their own init process, each one running a different set of services. Basically everything you do today using Xen, but at a higher speed and with lower memory overhead and without the option of running different kernel versions on different systems.

Sysfs and namespaces

Posted Aug 28, 2008 12:51 UTC (Thu) by danpb (subscriber, #4831) [Link]

We are working on supporting this in libvirt's "LXC" driver (LinuX Containers). This driver uses the clone() syscall along with the new CLONE_NEW{PID,UTS,USER,NS,IPC,NET} flags to create a container that is isolated from the "host" operating system.

There are roughly two ways of using this capability

- Workload isolation for applications. The application shares the same root filesystem as the host, perhaps with a few extra mounts points and custom networking.

- Security isolation for applications. The application has a totally isolated private root filesystem, custom networking, etc - nothing is shared with the host OS.

As of 2.6.26, only the workload isolation use case is usable. Well, actually not quite true, we can do the private root filesystem too, but it is not secure because we're lacking some kernel capabilities still. For workload management we will be integrating with cgroups to control CPU/memory/etc limits

For the security isolation use case to be usable in real world, the sysfs namespace patch is one of the core missing pieces. The second is device namespace - so the nodes in /dev/ and /dev/pts inside the container are separated from those of the host OS. It is not clear what the timeframe on this latter capability is going to appear. If it appears before 2.6.29 i'd be surprised...

Sysfs and namespaces

Posted Aug 28, 2008 16:20 UTC (Thu) by iabervon (subscriber, #722) [Link]

Note that there's a different variation that might be useful (and might be complete either before or after that): being able to have different users see a partially different system. For example, giving each non-root user a different /tmp directory (subdirectories of the real /tmp). It would also be possible to have a single machine with multiple heads, where each of these would appear as the only (or, at least, main) head; if you plug a USB mouse into the USB hub built into your monitor, it controls your pointer and not anybody else's, for example, and you own the auto-mount of the USB memory stick you plug in. And it might be nice to be able to have a developer on a shared system able to run an instance of postgres that seems to that user to be system-wide, but is actually private, without the postgres processes able to tell that they're not system-wide.

Sysfs and namespaces

Posted Aug 28, 2008 18:01 UTC (Thu) by ebiederm (subscriber, #35028) [Link]

From a high level it looks something like:
- The last couple of bugs with signal handling and init
fixed in the pid namespace

- sysfs

- The uid namespace

If you are someone who can take less than perfection you can build
a better chroot today.

I'm hoping once the current round of changes settles out we
can get a chroot like tool out to people so non-experts can
start using this code.

The short term goal is not to be a Xen replacement but to correctly
implement the namespaces we have and to do something useful. Which
basically amounts to building a better chroot, and to start reducing
the differences between vserver and openVZ.

Eric

Sysfs and namespaces

Posted Sep 3, 2008 18:37 UTC (Wed) by jlokier (guest, #52227) [Link]

I find myself wondering if these containers are nestable.

That is, the whole reason we need any virtualisation is applications (whole working systems) expect something which strongly resembles a single Linux box. Virtualisation provides that illusion, while isolating the application.

In the old days, it was enough to use 'processes' and 'directories' :-)
But applications grew, and did cleverer things like configure their own firewalls and virtual networks, and decided they really depend on a thing which looks strongly like a single Linux box.

Pretty soon, someone is going to decide that these containers are really neat, that you can put Apache in one, DNS in another, SMTP in another, etc., and build whole working systems like that.

Then someone else is going to want to take that working system, and run _that_ in a container... Will it work? Will the containers nest?

Sysfs and namespaces

Posted Sep 4, 2008 18:06 UTC (Thu) by adobriyan (guest, #30858) [Link]

It should, in theory, work and nest.

Sysfs and namespaces

Posted Sep 4, 2008 20:18 UTC (Thu) by ebiederm (subscriber, #35028) [Link]

Yes. The in kernel solutions are nestable. The out of tree solutions like OpenVZ and Vserver appear to have architecture limits that keeps them from nesting today.

Eric

The network isn't very virtual under Linux

Posted Sep 4, 2008 20:01 UTC (Thu) by renox (subscriber, #23785) [Link]

At work for various reason, I needed to change the VLAN tag used on one board, but the name of the virtual interface in Linux is eth0.<tag> which means that every part of our software which stored the name of the interface had to be modified: not good.

I asked if it was possible to create virtual interfaces with a 'logical name' independently of the value of the tag, but this isn't possible..

Oh well, at least Linux isn't naming the interfaces by the name of the HW maker as *BSD do I think, otherwise it'd be even worse..

The network isn't very virtual under Linux

Posted Sep 4, 2008 20:21 UTC (Thu) by ebiederm (subscriber, #35028) [Link]

ip link set eth0.tag name something_fixed

Hard coding your interface names in general is a bad idea
but linux can very much rename network interfaces allowing
you to give them a logical name.

Eric

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds