In recent times there has been quite a bit of attention paid to hypervisors
and full virtualization (or paravirtualization) solutions. The proponents
of the container approach - where all virtualized systems run in
well-contained sandboxes on the host's kernel - have been relatively quiet.
They have not been idle, however, as can be seen in the large amount of
work going into network namespaces.
For the container approach to work, every global resource in the system
must be wrapped in some sort of namespace. This wrapping has been done for
some relatively simple resources, such as the utsname information or
process IDs; some of the resulting code has already found its way into the
mainline. There is not a whole lot of use, however, for containers which
are completely isolated from the rest of the world; usually some sort of
networking capability is needed. For example, containers can usefully
contain a web browser (keeping it from exposing the rest of the system
should it prove vulnerable) or a web server - but only if networking
works. But containers should not be able to see each others' packet
streams, and, ideally, should be able to bind to the same ports without
interfering with each other.
Making that work requires network namespaces. These namespaces virtualize
all access to network resources - interfaces, port numbers, etc., -
allowing each container the network access it needs (but no more). As with
all other problems in computer science, the network namespace issue can be
addressed with another layer of indirection. There is a small problem with
this approach, however: the networking code is a vast pile of complex,
highly-tuned code overseen by developers who have little tolerance for
changes which introduce performance overhead or potential bugs. Getting
any sort of network namespace implementation merged is going to require
quite a bit of very careful work.
One approach can be seen in the L2 network namespace patch set
posted recently by Dmitry Mishin. These patches concentrate on the lower
levels of the network stack, trying to get proper namespaces established
for network devices and the IPv4 layer. In an attempt to minimize churn in
the networking code, the L2 namespace patch introduces the idea of the
"current network namespace," kept in a per-CPU variable. The current
namespace is implemented as a stack, with push and pop operations; in
theory, it allows all network operations to happen within the proper
namespace. Your editor was unable to convince himself that this scheme
would work properly in the face of any sort of kernel preemption, but that
may just be a matter of not having looked hard enough.
The net_device structure gains a net_ns field, providing
the namespace to which the device belongs. It is set to whatever namespace
is current when the device is created. The device lookup functions have
become namespace-aware; if a device does not belong to the current
namespace, it becomes invisible. A different version of the loopback
device is created for each namespace. Then, the IPv4 routing code has been
extended so that each namespace gets its own set of routing tables. The
code which matches incoming packets to sockets has also been made
namespace-aware; there is still a single hash table, but the namespace has
been made part of the match criteria.
Network interfaces made up of real hardware will normally remain in the
root namespace. Communication with other namespaces is made possible by
way of a "virtual Ethernet" device, included with the patch set. A virtual
device can be thought of as a wire into a restricted namespace; it presents
one device within that namespace and one in the parent (normally root)
namespace. Packets written to one end show up at the other. With the
addition of a few routing rules in the root namespace, packets meeting the
right criteria can be directed into (and out of) specific namespaces.
The L2 namespace patch provides the plumbing for the creation of little
virtualized Internets within a single system, but they do not yet provide
complete isolation. A process within its namespace can reconfigure its
interfaces, perhaps creating problems for the system as a whole.
Tightening things down is left to the L3 namespace patch, posted by
Daniel Lezcano. An L3 namespace is always the child of an L2 namespace; it
is the end of the line, however, being unable to have child namespaces of
its own. There are also no network admin capabilities in an L3 namespace;
once an L3 namespace is created, it is stuck with whatever network
configuration its parent gave it.
The end result is that a contained system can be put within an L3 namespace
and it should be able to perform networking without interfering with (or
even seeing) other systems in other namespaces.
A somewhat different approach can be seen in the network namespace patches
posted by Eric W. Biederman. Eric, aware of the challenges involved in
getting network namespaces merged, is far more concerned with the process
than the specific namespace implementation. So his patches focus mostly on
getting the internal APIs right.
The first step is to figure out how network namespaces are to be
represented. Rather than use a structure, Eric has opted for a mechanism
which marks all network-related global resources in a special way. These
resources get linked into a special section of the kernel which can be
cloned when a new namespace is created. Each global variable becomes an
offset into the per-namespace section; it must be accessed by way of a
special macro. This approach appears cumbersome, but it has a couple of
advantages. If a module with per-namespace variables is loaded, those
variables can be added to each existing namespace on the fly. And, if
namespaces are not in use, the overhead of the whole mechanism drops to
zero. This is an important feature: to have a hope of being merged, a
network namespace implementation will have to have no impact on systems
which are not using it.
The patch set (31 parts strong) then works through various parts of the
networking API, adding a namespace parameter to functions which need it.
There is no global "current namespace" concept in Eric's patches; it is,
instead, an explicit parameter everywhere. Thus, for example, every
function which creates a socket (they exist in every protocol
implementation) gets a namespace parameter. The sk_buff structure
(which represents a packet) has a namespace field assigned from either the
process creating it (for outbound packets) or the device it was received
from; the various protocol-specific functions are expected to take that
namespace into account. Functions dealing with netlink sockets get
namespace parameters, as do those which implement network device lookup, event
generation, and Unix-domain sockets. Like the L2 patches, Eric's
implementation includes a virtual network device (called "etun") which can
be use to route packets between namespaces.
Unlike the L2/L3 patches, Eric's work deals with the virtualization of the
networking-related /proc, sysctl, and sysfs interfaces. Doing so
requires adding shadow directory
support to sysfs. Shadow directories loosen the connection between
sysfs and the internal kobject hierarchy, allowing different namespaces to
see different contents in the same locations.
A key aspect of Eric's patch is that it implements little namespace
mechanism. Instead, much of the networking stack is made to test the
namespace it is given and fail if the root namespace is not in use. The
idea is to get the interfaces right first, then to start to fill in the
mechanism in relatively small pieces. The tests ensure that the network
stack will not surprise users by doing the wrong thing if it is not yet
fully prepared to handle non-root namespaces.
Despite the posting of all these patches, the amount of discussion has been
quite low. One gets the sense that the network developers have not yet
started to take these patches seriously. This issue seems unlikely to go
away, however; there remains a great deal of interest in getting container
features into the mainline kernel. Sooner or later, this discussion is
likely to take off.
to post comments)