Brief items
The 2.6.29 kernel is out,
released by Linus on
March 23. For those just tuning in, some of the most significant
features of 2.6.29 include the Btrfs filesystem (still very much in an
experimental mode), the squashfs filesystem, kernel mode setting for Intel
graphics adapters,
task
credentials, WiMAX support, the
filesystem freeze feature, and
much more; see
the
KernelNewbies 2.6.29 page for all the details.
As of this writing, merging of changes for 2.6.30 has not yet begun.
The 2.6.27.21 and 2.6.28.9 stable kernel updates were released
on March 23. Both contain a long list of fixes for bugs in the USB
subsystem, i915 graphics driver, device mapper, and sound subsystems (and
beyond).
Comments (none posted)
Kernel development news
Well, i consider kernel development to be just another form of
software development, so i don't subscribe to the view that it is
intrinsically different. (Yes, the kernel has many unique aspects -
but most software projects have unique aspects.)
In terms of development methodology and tools, in fact i claim that
the kernel workflow and style of development can be applied to most
user-space software projects with great success.
--
Ingo Molnar
And I'd like to point out that largely *because* NetworkManager
usually doesn't work around stupid drivers and bad infrastructure,
but instead encourages developers (including myself) to fix that
infrastructure and drivers, we've come quite a long way in driver
quality over the past few years.
NetworkManager is both the carrot and the stick. If NM just worked
around broken stuff and proprietary drivers, it would be a
hacktower of doom and we may still be stuck largely in
2006-wireless land.
--
Dan Williams
Where's your bravery, man? :-)
I've been using [ext4] on my laptop since July, and haven't lost
significant amounts of data yet.
--
Ted Ts'o
Trying to play God by fsync'ing your file descriptor is first of
all, a very selfish thing, and second of all, rather less effective
than you seem to think it is.
--
Tom Christiansen is back
Comments (8 posted)
By Jake Edge
March 25, 2009
An in-kernel tracing infrastructure for user-space code, utrace, has long
been in a kind of pending state; it has shipped in every Fedora kernel
since Fedora Core 6, and has done some time in the -mm tree, but it has
never gotten into the mainline. That may now be changing, given a recent
push for inclusion of the
core utrace code. There are some lingering questions about including
utrace, at least for 2.6.30, because the patchset doesn't add any
in-kernel user of the interface.
Utrace grew out of Roland McGrath's work on maintaining the
ptrace() system call. That call is used by user-space programs
to do things like trace system calls using strace, but it is also
used in less obvious ways—to implement user-mode-linux (UML) for
example. While ptrace() has generally sufficed,
it is, by all accounts, a rather ugly and flawed interface both for kernel
hackers to maintain and for developers to use. McGrath described the genesis of utrace in a recent
linux-kernel post:
I hatched the essential design of utrace when I'd recently spent a whole
lot of time fixing the innards of ptrace and a whole lot of time helping
userland implementors of debuggers and the like figure out how to work
with ptrace (and hearing their complaints about it). At the same time,
the group I'm in (still) was contemplating both the implementation
issues of a generic debugger, how to make it tractable to work up to far
smarter debuggers, and also the design of what became systemtap.
Basically, utrace implements a framework for controlling user-space tasks.
It provides an interface that can be used by various tracing "engines",
implemented
as loadable kernel modules, that wish to be notified of events that occur
on threads
of interest. As might be expected, engines register callback functions for
specific events, then attach to whichever thread they wish to trace.
The
callbacks are made from "safe" places in the kernel, which allows the
functions great leeway in the kinds of processing they can do.
No locks are held when the callbacks are made, so they can block for a short
time (in calls like
kmalloc()), but they shouldn't block for long periods. Doing so,
risks making the SIGKILL signal from working properly. If the
callback needs to wait for I/O or block on some other long-running
activity, it should stop the execution of the thread and return, then
resume the thread when the operation completes.
There are various events that can be watched via utrace: system call entry
and exit, fork(), signals being sent to the task, etc.
Single-stepping through a task being traced can also be handled via
utrace. One of the benefits that utrace provides, which ptrace()
lacks, is the ability to have multiple engines tracing the same task.
Utrace is well documented in DocBook manual
included with the patch.
LWN first looked at utrace
just over two years ago, but, since then, it has largely disappeared from
view. Reimplementing ptrace() using utrace is
certainly one of the goals, but the current patches do not do that. But,
there is a fundamental disagreement between McGrath and other kernel
hackers about whether utrace can be merged without it. The problem is that
there is no in-tree user of the new interface, and, as Ted Ts'o put it, "we need
to have a user for the kernel interface along with the new kernel
interface".
The proposed utrace patchset consists of a small patch to clean up some of
the tracehook functionality, a large 4000 line patch that implements the
utrace core, and another patch that adds an ftrace tracer that is based on
utrace event handling. The latter, implemented by SystemTap
developer Frank Eigler, would provide an in-tree user of the new utrace
code, but received a rather chilly response
from Ingo Molnar: "[...] without the
ftrace plugin the
whole utrace machinery is just something that provides a _ton_ of
hooks to something entirely external: SystemTap mainly."
Therein lies one of the main concerns expressed about utrace. The
utrace-ftrace interface is not seen as a real user of utrace, more of a
"big distraction", as Andrew Morton called it. The worry is that adding utrace
just makes it easier to keep SystemTap out of the mainline. While the
kernel hackers have some serious reservations about the specifics of the
SystemTap implementation, they would like to see it head towards the
mainline. The fear is that by merging things like utrace, it may enable
SystemTap to stay out of the mainline that much longer. Molnar posted his take on the issue, concluding:
Putting utrace upstream now will just make it more
convenient to have SystemTap as a separate entity - without any of
the benefits. Do we want to do that? Maybe, but we could do better i
think.
In addition, Molnar is not pleased that the utrace changes haven't been
reviewed by the ftrace developers and were submitted just as the merge
window for 2.6.30 is about to open. He believes that McGrath, Eigler, and
the other utrace developers should be working with the ftrace team:
kernel/utrace.c should probably be introduced as
kernel/trace/utrace.c not kernel/utrace.c. It also overlaps pending
work in the tracing tree and cooperation would be nice and desired.
The ftrace/utrace plugin is the only real connection utrace has to
the mainline kernel, so proper review by the tracing folks and
cooperation with the tracing folks is very much needed for the whole
thing.
But McGrath sees things rather differently. From his perspective, utrace
has enough usefulness in its own right—not primarily as just a piece
of SystemTap—to be considered for the mainline. Several different
uses for utrace, in addition to the ptrace() cleanup, were
mentioned in the thread: kmview, a kernel
module for virtualization; uprobes for DTrace-style user-space probing;
changing UML to use utrace directly, rather than ptrace(); and
more. Eigler also defended utrace as a
standalone feature:
utrace is a better way to perform user thread management than what is
there now, and the utrace-ftrace widget shows how to *hook* thread
events such as syscalls in a lighter weight / more managed way than
the first one proposed.
Molnar would like to see the "rewrite-ptrace-via-utrace" patch included
before merging utrace. That would give the facility a solid in-kernel
user, which could be used by other kernel developers to test and debug
utrace. But, McGrath is not yet ready to
submit that code:
The utrace-ptrace code there today is really not
very nice to look at, and it's not ready for prime time. As has been
mentioned, it is a "pure clean-up exercise". As such, it's not the top
priority. It also didn't seem to me like much of an argument for merging
utrace: "Look, more code and now it still does the same thing!"
In some ways, the association with SystemTap is unfairly coloring the
reaction to utrace. Molnar posted an excellent summary of the issues that stop him (and other
kernel hackers) from using SystemTap—along with some possible
solutions—but utrace and SystemTap aren't equivalent. It may not
make sense to merge utrace without a serious in-kernel user of the
interface, but most of the rest of the arguments have been about SystemTap,
not utrace. As McGrath puts it:
This ptrace work really buys nothing with immediate pay-off at all. It's a
real shame if its lack keeps people from actually looking at utrace itself.
(This has been a long conversation so far with zero discussion of the code.)
A collaboration with focus on what new things can be built, rather than on
reasons not to let the foundations be poured, would be a lovely thing.
It remains to be seen whether utrace will make its way into 2.6.30 or not.
Linus Torvalds was unimpressed with
utrace dominating Fedora kerneloops.org reports, as relayed by Molnar—though the bug that
caused those problems has been long fixed. McGrath sees value in
merging utrace before the ptrace() rewrite is ready, while other
kernel developers do not. If utrace misses this merge window, it would
seem likely that it will return for 2.6.31, along with the rewrite; at that
point merging would seem quite likely.
Comments (1 posted)
March 25, 2009
This article was contributed by Valerie Aurora (formerly Henson)
In
last week's article,
I reviewed the use cases, basic concepts, and common design problems
of unioning file systems. This week, I'll describe several
implementations of unioning file systems in technical detail. The
unioning file systems I'll cover in this article are Plan 9 union
directories, BSD union mounts, Linux union mounts. The next article
will cover unionfs, aufs, and possibly one or two other unioning file
systems, and wrap up the series.
For each file system, I'll describe its basic architecture, features,
and implementation. The discussion of the implementation will focus
in particular on whiteouts and directory reading. I'll wrap up with
a look at the software engineering aspects of each implementations;
e.g., code size and complexity, invasiveness, and burden on file system
developers.
Before reading this article, you might want to check out Andreas
Gruenbacher's just published write-up of
the union mount workshop
held last November. It's a good summary of the unioning file systems
features which are most pressing for distribution developers. From
the introduction: "All of the use cases we are interested in basically
boil down to the same thing: having an image or filesystem that is
used read-only (either because it is not writable, or because writing
to the image is not desired), and pretending that this image or
filesystem is writable, storing changes somewhere else."
Plan 9 union directories
The
Plan 9 operating
system
(
browseable
source code here) implements unioning in its own special Plan 9
way. In Plan 9 union directories, only the top-level directory
namespace is merged, not any subdirectories. Unconstrained by UNIX
standards, Plan 9 union directories don't implement whiteouts and
don't even screen out duplicate entries - if the same file name
appears in two file systems, it is simply returned twice in directory
listings.
A Plan 9 union directory is created like so:
bind -a /home/val/bin/ /bin
This would cause the directory
/home/val/bin to be union
mounted "after" (the
-a option)
/bin; other
options are to place the new directory before the existing directory,
or to replace the existing directory entirely. (This seems an odd
ordering to me, since I like commands in my personal
bin/
to take precedence over the system-wide commands, but that's the
example from the Plan 9 documentation.) Brian Kernighan
explains one
of the uses of union directories: "
This mechanism of union
directories replaces the search path of conventional UNIX shells. As
far as you are concerned, all executable programs are in /bin." Union
directories can theoretically replace many uses of the fundamental
UNIX building blocks of symbolic links and search paths.
Without whiteouts or duplicate elimination, readdir() on
union directories is trivial to implement. Directory entry offsets
from the underlying file system correspond directly to the offset in
bytes of the directory entry from the beginning of the directory. A
union directory is treated as though the contents of the underlying
directories are concatenated together.
Plan 9 implements an alternative to readdir() worth
noting, dirread().
dirread() returns structures of type Dir,
described in the stat()
man page. The important part of the Dir is
the Qid member. A Qid is:
...a structure
containing path and vers fields: path is
guaranteed to be unique among
all path names currently on the file server, and vers changes each
time the file is modified. The path is a long long (64 bits, vlong)
and the vers is an unsigned long (32 bits, ulong).
So why is this interesting? One of the
reasons readdir() is such a pain to implement is that it
returns the d_off member of struct dirent, a
single off_t (32 bits unless the application is compiled
with large file support), to mark the directory entry where an
application should continue reading on the next readdir()
call. This works fine as long as d_off is a simple byte
offset into a flat file of less than 232 bytes and existing directory
entries are never moved around - not the case for many modern file
systems (XFS, btrfs, ext3 with htree indexes). The
96-bit Qid is a much more useful place marker than the 32
or 64-bit off_t. For a good summary of the issues involved in
implementing readdir(),
read Theodore
Y. Ts'o's excellent post on the topic to the btrfs mailing list.
From a software engineering standpoint, Plan 9 union directories are
heavenly. Without whiteouts, duplicate entry elimination, complicated
directory offsets, or merging of namespaces beyond the top-level
directory, the implementation is simple and easy to maintain.
However, any practical implementation of unioning file systems for
Linux (or any other UNIX) would have to solve these problems. For our
purposes, Plan 9 union directories serve primarily as inspiration.
BSD union mounts
BSD implements two forms of unioning: the
"-o union"
option to the
mount command, which produces a union
directory similar to Plan 9's, and the
mount_unionfs
command, which implements a more full-featured unioning file system
with whiteouts and merging of the entire namespace. We will focus on
the latter.
For this article, we use two sources for specific implementation
details: the original BSD union mount implementation as described in
the 1995 USENIX paper
Union
mounts in 4.4BSD-Lite [PS], and
the FreeBSD
7.1 mount_unionfs man page and source code. Other
BSDs may vary.
A directory can be union mounted either "below" or "above" an existing
directory or union mount, as long as the top branch of a writable
union is writable. Two modes of whiteouts are supported: either a
whiteout is always created when a directory is removed, or it is only
created if another directory entry with that name currently exists in
a branch below the writable branch. Three modes for setting the
ownership and mode of copied-up files are supported. The simplest is
transparent, in which the new file keeps the same owner
and mode of the original. The masquerade mode makes
copied-up files owned by a particular user and supports a set of
mount options for determining the new file mode.
The traditional mode sets the owner to the user who ran
the union mount command, and sets the mode according to the umask at
the time of the union mount.
Whenever a directory is opened, a directory of the same name is
created on the top writable layer if it doesn't already exist. From
the paper:
By creating shadow directories aggressively during lookup the union
filesystem avoids having to check for and possibly create the chain of
directories from the root of the mount to the point of a copy-up.
Since the disk space consumed by a directory is negligible, creating
directories when they were first traversed seemed like a better
alternative.
As a result, a "find /union" will result in copying every
directory (but not directory entries pointing to non-directories) to
the writable layer. For most file system images, this will use a
negligible amount of space (less than, e.g., the space reserved for
the root user, or that taken up by unused inodes in an FFS-style file
system).
A file is copied up to the top layer when it is opened with write
permission or the file attributes are changed. (Since directories are
copied over when they are opened, the containing directory is
guaranteed to already exist on the writable layer.) If the file to be
copied up has multiple hard links, the other links are ignored and the
new file has a link count of one. This may break applications that
use hard links and expect modifications through one link name to show
up when referenced through a different hard link. Such applications
are relatively uncommon, but no one has done a systematic study to see
which applications will fail in this situation.
Whiteouts are implemented with a special directory entry
type, DH_WHT. Whiteout directory entries don't refer to
any real inode, but for easy compatibility with existing file system
utilities such as fsck, each whiteout directory entry
includes a faux inode number, the WINO reserved whiteout
inode number. The underlying file system must be modified to support
the whiteout directory entry type. New directories that replace a
whiteout entry are marked as opaque via a new "opaque" inode attribute
so that lookups don't travel through them (again requiring minimal
support from the underlying file system).
Duplicate directory entries and whiteouts are handled in the userspace
readdir() implementation. At opendir()
time, the C library reads the directory all at once, removes
duplicates, applies whiteouts, and caches the results.
BSD union mounts don't attempt to deal with changes to branches below
the writable top branch (although they are permitted). The
way rename() is handled is not described.
An example from the mount_unionfs man page:
The commands
mount -t cd9660 -o ro /dev/cd0 /usr/src
mount -t unionfs -o noatime /var/obj /usr/src
mount the CD-ROM drive /dev/cd0 on /usr/src and then attaches /var/obj on
top. For most purposes the effect of this is to make the source tree
appear writable even though it is stored on a CD-ROM. The -o noatime
option is useful to avoid unnecessary copying from the lower to the upper
layer.
Another example (noting that I believe source control is best
implemented outside of the file system):
The command
mount -t unionfs -o noatime -o below /sys $HOME/sys
attaches the system source tree below the sys directory in the user's
home directory. This allows individual users to make private changes to
the source, and build new kernels, without those changes becoming visible
to other users.
Linux union mounts
Like BSD union mounts, Linux union mounts implement file system
unioning in the VFS layer, with some minor support from underlying
file systems for whiteouts and opaque directory tags. Several
versions of these patches exist, written and modified by Jan Blunck,
Bharata B. Rao, and Miklos Szeredi, among others.
One version of this code is merges the top-level directories only,
similar to Plan 9 union directories and the BSD -o union
mount option. This version of union mounts, which I refer to as union
directories, are described in some detail in a
recent LWN article by
Goldwyn Rodrigues and
in Miklos Szeredi's recent
post of an updated patch set. For the remainder of this article,
we will focus on versions of union mount that merge the full
namespace.
Linux union mounts are currently under active development. This
article describes the version released by Jan Blunck against Linux
2.6.25-mm1, util-linux 2.13, and e2fsprogs 1.40.2. The patch sets, as
quilt series, can be downloaded from Jan's ftp site:
Kernel patches: ftp://ftp.suse.com/pub/people/jblunck/patches/
Utilities: ftp://ftp.suse.com/pub/people/jblunck/union-mount/
I have created a web page with links to git versions of the above
patches and some HOWTO-style documentation
at http://valerieaurora.org/union.
A union is created by mounting a file system with
the MS_UNION flag
set. (The MS_BEFORE, MS_AFTER,
and MS_REPLACE are defined in the mount code
base but not currently used.) If the MS_UNION flag is
specified, then the mounted file system must either be read-only or
support whiteouts. In this version of union mounts, the union mount
flag is specified by the "-o union" option
to mount. For example, to create a union of two loopback
device file systems, /img/ro and /img/rw, you would run:
# mount -o loop,ro,union /img/ro /mnt/union/
# mount -o loop,union /img/rw /mnt/union/
Each union mount creates a
struct union_mount:
struct union_mount {
atomic_t u_count; /* reference count */
struct mutex u_mutex;
struct list_head u_unions; /* list head for d_unions */
struct hlist_node u_hash; /* list head for searching */
struct hlist_node u_rhash; /* list head for reverse searching */
struct path u_this; /* this is me */
struct path u_next; /* this is what I overlay */
};
As described
in
Documentation/filesystems/union-mounts.txt, "All
union_mount structures are cached in two hash tables, one for lookups
of the next lower layer of the union stack and one for reverse lookups
of the next upper layer of the union stack."
Whiteouts and opaque directories are implemented in much the same way
as in BSD. The underlying file system must explicitly support whiteouts
by defining the .whiteout inode operation for directories
(currently, whiteouts are only implemented for ext2, ext3, and tmpfs).
The ext2 and ext3 implementations use the whiteout directory entry
type, DT_WHT, which has been defined
in include/linux/fs.h for years but not used outside of
the Coda file system until now. A reserved whiteout inode
number, EXT3_WHT_INO, is defined but not yet used;
whiteout entries currently allocate a normal inode. A new inode
flag, S_OPAQUE, is defined to mark directories as opaque.
As in BSD, directories are only marked opaque when they replace a
whiteout entry.
Files are copied up when the file is opened for writing. If
necessary, each directory in the path to the file is copied to the top
branch (copy-on-demand of directories). Currently, copy up is only
supported for regular files and directories.
readdir() is one of the weakest points of the current
implementation. It is implemented the same way as BSD union mount
readdir(), but in the kernel. The d_off
field is set to the offset within the current underlying directory,
minus the sizes of the previous directories. Directory entries from
directories underneath the top layer must be checked against previous
entries for duplicates or whiteouts. As currently implemented,
each readdir() (technically, getdents())
system call reads all of the previous directory entries into an
in-kernel cache, then compares each entry to be returned with those
already in the cache before copying it to the user buffer. The end
result is that readdir() is complex, slow, and
potentially allocates a great deal of kernel memory.
One solution is to take the BSD approach and do the caching, whiteout,
and duplicate processing in userspace. Bharata B. Rao
is designing
support for union mount readdir() in glibc.
(The POSIX standard permits readdir() to be implemented
at the libc level if the bare kernel system call does not fulfill all
the requirements.) This would move the memory usage into the
application and make the cache persistent. Another solution would be
to make the in-kernel cache persistent in some way.
My suggestion is to take a technique from BSD union mounts and extend
it: proactively copy up not just directory entries for directories,
but all of the directory entries from lower file systems, process
duplicates and whiteouts, make the directory opaque, and write it out
to disk. In effect, you are processing the directory entries for
whiteouts and duplicates on the first open of the directory, and then
writing the resulting "cache" of directory entries to disk. The
directory entries pointing to files on the underlying file systems
need to signify somehow that they are "fall-through" entries (the
opposite of a whiteout - it explicitly requests looking up an object
in a lower file system). A side effect of this approach is that
whiteouts are no longer needed at all.
One problem that needs to be solved with this approach is how to
represent directory entries pointing to lower file systems. A number
of solutions present themselves: the entry could point to a reserved
inode number, the file system could allocate an inode for each entry
but mark it with a new S_LOOKOVERTHERE inode attribute,
it could create a symlink to a reserved target, etc. This approach
would use more space on the overlying file system, but all other
approaches require allocating the same space in memory, and generally
memory is more dear than disk.
A less pressing issue with the current implementation is that inode
numbers are not stable across boot
(see the previous unioning
file systems article for details on why this is a problem).
If "fall-through" directories are implemented by allocating an inode
for each directory entry on underlying file systems, then stable inode
numbers will be a natural side effect. Another option is to store a
persistent inode map somewhere - in a file in the top-level directory,
or in an external file system, perhaps.
Hard links are handled - or, more accurately, not handled - in the
same way as BSD union mounts. Again, it is not clear how many
applications depend on modifying a file via one hard-linked path and
seeing the changes via another hard-linked path (as opposed to symbolic
link). The only method I can come up with to handle this correctly is
to keep a persistent cache somewhere on disk of the inodes we have
encountered with multiple hard links.
Here's an example of how it would work: Say we start a copy up for
inode 42 and find that it has a link count of three. We would create an
entry for the hard link database that includes the file system id, the
inode number, the link count, and the inode number of the new copy on
the top level file system. It could be stored in a file in CSV
format, or as a symlink in a reserved directory in the root directory
(e.g., "/.hardlink_hack/<fs_id>/42", which is a
link to "<new_inode_num> 3"), or in a real
database. Each time we open an inode on an underlying file system, we
look it up in our hard link database; if an entry exists, we decrement
the link count and create a hard link to the correct inode on the new
file system. When all of the paths are found, the link count drops to
one and the entry can be deleted from the database. The nice thing
about this approach is that the amount of overhead is bounded and will
disappear entirely when all the paths to the relevant inodes have been
looked up. However, this still introduces a significant amount of
possibly unnecessary complexity; the BSD implementation shows that
many applications will happily run with not-quite-POSIXLY-correct hard
link behavior.
Currently, rename() of directories across branches
returns EXDEV, the error for trying to rename a file
across different file systems. User space usually handles this
transparently (since it already has to handle this case for
directories from different file systems) and falls back to copying the
contents of the directory over one by one. Implementing
recursive rename() of directories across branches in the
kernel is not a bright idea for the same reasons as rename across
regular file systems; probably returning EXDEV is the
best solution.
From a software engineering point of view, union mounts seem to be a
reasonable compromise between features and ease of maintenance. Most
of the VFS changes are isolated into fs/union.c, a file
of about 1000 lines. About 1/3 of this file is the
in-kernel readdir() implementation, which will almost
certainly be replaced by something else before any possible merge.
The changes to underlying file systems are fairly minimal and only
needed for file systems mounted as writable branches. The main
obstacle to merging this code is the readdir()
implementation. Otherwise, file system maintainers have been
noticeably more positive about union mounts than any other unioning
implementation.
A nice summary of union mounts can be found in
Bharata
B. Rao's union mount slides for FOSS.IN [PDF].
Coming next
In the next article, we'll review unionfs and aufs, and compare the
various implementations of unioning file systems for Linux. Stay
tuned!
Comments (7 posted)
By Jonathan Corbet
March 24, 2009
Packet filtering and firewalling has a long history in Linux. The first
filtering mechanism, called "ipfwadm," was released in 1995 for
the 1.2.1 kernel. This code was used until the 2.2.0 stable release
(January, 1999), when the new "ipchains" module took over. While ipchains
was useful, it only lasted until 2.4.0 (January, 2001), when it, too, was
replaced by iptables/netfilter, which remains in the kernel now. If
netfilter maintainer Patrick McHardy has his way, though, iptables, too, will be
gone in the future, replaced by yet another mechanism called
"nftables." This article will give an overview of how nftables works,
followed by a discussion of the motivations behind this change.
The first public nftables
release came out on March 18. This code has been in the works for
a while, though, and the ideas were discussed at the 2008 Netfilter Workshop.
So nftables is not quite as new as it might seem.
The current iptables code has a lot of protocol awareness built into it.
There is, for example, a module dedicated to extracting port numbers from
UDP packets which is different from the module concerned with TCP packets.
The nftables implementation is entirely different; there is no protocol
knowledge built into it at all. Instead, nftables is implemented as a
simple virtual machine which interprets code loaded from user space. So
nftables has no operation which says anything like "compare the IP
destination address to 196.168.0.1"; instead, it would execute code which
looks like:
payload load 4 offset network header + 16 => reg 1
compare reg 1 192.168.0.1
(Patrick presents the code in mnemonic form, and your editor will do the
same; the actual code loaded into the kernel uses opcodes
instead). The first line loads four bytes from the packet,
located 16 bytes past the beginning of the network reader, into
register 1. The second line then compares that register against the
given network address.
The language can do a lot more than just comparing addresses, of course.
There is, for example, a set lookup feature. Consider the following:
payload load 4 offset network header + 16 => reg 1
set lookup reg 1 load result in verdict register
{ "192.168.0.1" : jump chain1,
"192.168.0.2" : drop,
"192.168.0.3" : jump chain2 }
This code will cause packets aimed at 192.168.0.2 to be dropped; for the
other two listed addresses, control will be sent to specific rule chains.
This set feature allows for multi-branch rules in a way which cannot be
done with the current iptables implementation (though the ipset mechanism helps in that
regard).
The above code also introduces the "verdict register," which records an
action to be performed on a packet. In nftables, more than one verdict can
be rendered on a packet; it is possible to add a packet to a specific counter,
log it, and drop it all in a single chain without the need (as seen in
iptables) to repeat tests.
There are a number of other capabilities built into the nftables virtual
machine. There's a set of operations for communicating with the
connection-tracking mechanism, allowing connection information to be used
in deciding the fate of specific packets. Other operators deal with
various bits of packet metadata known to the networking subsystem; these
include the length, the protocol type, security mark information, and
more. Operators exist for logging packets and incrementing counters.
There's also a full set of comparison operations, of course.
Network administrators are unlikely to be impressed by the idea of
programming a low-level virtual machine for their future firewalling
needs. The good news is that there will be no need for them to do so.
Instead, they'll write higher-level rules which will then be compiled into
virtual machine code before being loaded into the kernel. The nftables
utility does this work, implementing a human-readable language
encapsulating most of the needed information about how packets are put
together. So, if we look back to the first test described above:
payload load 4 offset network header + 16 => reg 1
compare reg 1 192.168.0.1
The administrator would simply write "ip daddr 192.168.0.1" and
let nftables turn that into the above code. A full (if simple)
rule looks something like this:
rule add ip filter output ip daddr 192.168.0.1 counter
This rule will count packets sent to 192.168.0.1.
The new nftables API is based on netlink, naturally. Unlike the current
iptables API, it has the ability to modify individual rules without the
need to reload the entire configuration. There is also a decompilation
facility built into nftables that allows the recreation of
human-readable rules from the current in-kernel configuration.
[PULL QUOTE:
This could be a
disruptive and expensive transition; the kernel development community will
want to see some very good reasons for inflicting this pain on its users.
END QUOTE]
All told, it looks like a nicely-designed packet filtering mechanism, but the
merging of nftables is likely to be controversial. The iptables
mechanism works well, and is widely used; replacing it with code which
breaks the user-space API and breaks all existing iptables
configurations is guaranteed to raise some eyebrows. This could be a
disruptive and expensive transition, even if, as seems necessary, the
developers commit to maintaining both iptables and nftables in the mainline for an extended
period of time. The kernel development community will
want to see some very good reasons for inflicting this pain on its users.
There are some good reasons, but one should start by noting that it should
be possible to create a tool which reads current iptables configurations
and converts them to the nftables language - or even directly to kernel
virtual machine code. Patrick seems to expect to create such a tool One Of
These Days, but it does not exist at this time.
Some of the reasons for replacing iptables have already been hinted at above. The protocol
knowledge built into the iptables code has turned out to be a problem over
time; there is a lot of duplicated code doing the same thing (extracting
port numbers, say) for different protocols. Even worse, the capabilities
and syntax tend to vary from one protocol to the next. By moving all of
that knowledge out to user space, nftables greatly simplifies the in-kernel
code and allows for much more consistent treatment of all protocols.
There are a lot of optimization possibilities built into the new system.
Some expensive operations (incrementing counters, for example) can be
skipped unless the user really needs them.
Features like set lookups and range mapping can collapse a whole set of
iptables rules into a single nftables operation. Since filtering rules are
now compiled, there is also potential for the compiler to optimize the
rules further. Traditional firewall configurations tend to perform the same
tests repeatedly; a smart nftables compiler could eliminate much of
that duplicated work. Unsurprisingly, this optimization remains on the "to
do" list for now, but the fact that all of this work is done in user space
will make it easy to add such features in the future.
The nftables tool will also be able to perform a higher level of validation
on the rules it is given, and it will be able to provide more useful
diagnostics than can be had from the iptables code.
But, arguably, the most important motivation is the ability to dump the
current ABI.
The iptables ABI has become an increasing impediment to development over
time. It includes protocol-specific fields which has made it hard to
extend; that is part of why there are actually three copies of the iptables
code in the kernel. When developers wanted to implement arptables and
ebtables, they essentially had to copy the code and bang it into a new,
protocol-specific shape. Patrick estimates that, even after four years of
unification work, the kernel contains some 10,000 lines of duplicated
filtering code. Beyond that, the structures used in the ABI are also used
directly in the kernel's internal representation, making that
implementation even harder to change. Separating the two would be possible
through the addition of a translation layer, but the details involved
(including the need to translate in both directions) increase the risk of
adding subtle problems. In summary, the iptables ABI has
become a serious impediment to further progress in packet filtering.
Nftables is a chance to dump all of that code and replace it with a much
smaller filtering core which should prove to be quite a bit more
flexible. With any luck, nftables should last a long time; the virtual
machine can be extended in unexpected ways without the need to break
the user-space ABI (again). It's smaller size should make it well suited
to small router deployments, while its lockless design should appeal to
administrators of high-end systems. All told, chances are good that the
larger community will eventually see this change as being worthwhile. But
not for a while: there are some unfinished pieces in nftables, and the
larger discussion has not yet begun.
(For more information, see this weblog
posting from August, 2008 and the slides
from Patrick's presentation [ODF] at the Netfilter Workshop).
Comments (79 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
- Roland McGrath: utrace .
(March 21, 2009)
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>