Brief items
The current development kernel is 3.2-rc3,
released on November 23. "
Anyway,
whether you will be stuffing yourself with turkey tomorrow or not, there's
a new -rc out. I'd love to say that things have been calming down, and that
the number of commits just keep shrinking, but I'd be lying. -rc3 is
actually bigger than -rc2, mainly due to a network update (none in -rc2)
and with Greg doing his normal usb/driver-core/tty/staging thing."
Linus appears to be back on the Wednesday release schedule, so expect -rc4
sometime shortly after this page is published.
Stable updates: the 2.6.32.49, 3.0.11, and 3.1.3 stable kernel updates were released on
November 28; they contained a long list of fixes and (for 3.x) one bit
of USB driver breakage. The 3.0.12 and 3.1.4 updates came out shortly thereafter with
one patch to fix that problem.
Comments (none posted)
Modules are evil. They are a security issue, and they encourage a
"distro kernel" approach that takes forever to compile. Just say
no. Build a lean and mean kernel that actually has what you need,
and nothing more. And don't spend stupid time compiling modules you
won't need.
--
Linus Torvalds
I listen to songs like "Believe" by Cher sometimes. In my head,
it's a song about the lonely life of a NMI watchdog handler
(seriously). That is, if an NMI watchdog handler could express the
sheer and utter loneliness of its existence.
--
Jon
Masters
Comments (37 posted)
DM-Steg is a kernel module
that adds steganographic encryption to the device mapper. "Steganographic"
means that the encrypted data is hidden to the point that its very
existence can be denied. "
Steg works with substrates (devices
containing ciphertext) to export plaintext- containing block devices, known
as aspects, to the user. Without having the key(s), there is no way of
determining how many aspects a substrate contains, or if it contains any
aspects at all." The initial release of this module has just been
announced.
"
The code has only ever been tested on my PC, but it works very
nicely for me and has stopped eating my data, so I figure it's ready for
public consumption!" See
this document [PDF] for details.
Comments (14 posted)
Kernel development news
By Jonathan Corbet
November 30, 2011
Visitors to the
features
page on the
Open vSwitch web site
may be forgiven if they do not immediately come away with a good
understanding of what this package does. The feature list is full of
enlightening bullet points like "LACP (IEEE 802.1AX-2008)", "802.1ag link
monitoring", and "Multi-table forwarding pipeline with flow-caching
engine". Behind the acronyms, Open vSwitch is a virtual switch that has
already seen a lot of use in the Xen community and which is applicable to
most other virtualization schemes as well. After some years as an
out-of-tree project, Open vSwitch has recently made a push for inclusion
into the mainline kernel.
Open vSwitch is a network switch; at its lowest level, it is concerned with
routing packets between interfaces. It is aimed at virtualization users,
so, naturally, it is used in the creation of virtual networks. A switch
can be set up with a number of virtual network interfaces, most of which
are used by virtual machines to communicate with each other and the wider
world. These virtual networks can be connected across hosts and across
physical networks. One of the key features of Open vSwitch appears to be
the ability to easily migrate virtual machines between physical hosts and
have their network configuration (addresses, firewall rules, open
connections, etc.) seamlessly follow.
Needless to say, there is no shortage of features beyond making it easier
to move guests around. Open vSwitch offers a long list of options for access
control, quality-of-service control, network bridging, traffic monitoring,
and more. The OpenFlow protocol is
supported, allowing the integration of interesting protocols and
controllers into the network. Open vSwitch has been shipped as part of a
number of products and it shows; it has the look of a polished, finished
offering.
Most of Open vSwitch is implemented in user space, but there is one kernel
module that makes the whole thing work; that module was submitted for review in mid-November. Open
vSwitch tries to make use of existing networking features to the greatest
extent possible; the kernel module mostly implements a control interface
allowing the user-space code to make routing decisions. Routing packets
through user space would slow things down considerably, so the interface is
set up to avoid the user-space round trip whenever possible.
When the Open vSwitch module receives a packet on one of its interfaces, it
generates a "flow key" describing the packet in general terms. An example
key from the submission is:
in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
frag=no), tcp(src=49163, dst=80)
Most of the fields should be fairly self-explanatory; this key describes a
packet that arrived on port (interface) 1, aimed at TCP port 80
on host 172.18.0.52.
If Open vSwitch does not know how to process the packet, it will pass it to
the user-space daemon, along with the generated flow key. The daemon can
then decide what should be done; it will also, normally, pass a rule back
to the kernel describing how to handle related packets in the future.
These rules start with the flow key, which may be generalized somewhat, and
include a set of associated actions. Possible actions include:
- Output the packet to a specific port, forwarding it on its way
to its final destination.
- Send the packet to user space for further consideration. The
destination process may or may not be the main Open vSwitch control
daemon.
- Make changes to the packet header on its way through; network address
translation could be implemented this way, for example.
- Add an 802.1Q virtual LAN header in preparation for tunneling the
packet to another host; there is also an action for stripping such
headers at the receiving end.
- Record attributes of the packet for statistics generation.
Once a rule for a given type of packet has been installed into the kernel,
future packets can be routed quickly without the need for further
user-space intervention. If the switch is working properly, most packets
should never need to go through the control daemon.
Open vSwitch, by all appearances, is a useful and powerful mechanism; the
networking developers seem to agree that it would be a good addition to the
kernel. There is, however, some disagreement over the implementation. In
particular, the patch adds a new packet classification and control
mechanism, but the kernel already has a traffic control system of its own;
duplicating that infrastructure is not a popular idea. As Jamal Hadi Salim
put it:
You are replicating a lot of code and semantic that exist (not just
on classifier actions). You could improve the existing
infrastructure instead.
Jamal suggested that Open vSwitch could add a special-purpose classifier
for its own needs, but that classifier should fit into the existing traffic
control subsystem.
That said, there seems to be some awareness within the networking community
that the kernel's traffic controller may not quite be up to the task. Eric
Dumazet noted that its scalability is not
what it could be and that the code reflects its age; he said: "Maybe
its time to redesign a new model, based on modern techniques."
Others seemed to agree with this assessment. The traffic controller, it
appears, is in need of serious improvements or replacement regardless of
what happens with Open vSwitch.
The fact that the traffic controller is not everything Open vSwitch needs
will not normally be considered an adequate justification for duplicating
its infrastructure, though. The obvious options available to the Open
vSwitch developers will be to (1) improve the traffic controller to
the point that it does work, or (2) position the Open vSwitch
controller as a plausible long-term replacement. Neither task is likely to
be easy. The outcome of this discussion may well be that developers who
were hoping to merge their existing code will find themselves tasked with a
fair amount of infrastructural work.
That can be the point where those developers take option (3): go away and
continue to maintain their code out of tree. Requiring extra work from
contributors can cause them to simply give up. But if the networking
maintainers accept duplicated subsystems, the likely outcome is a lot of
wasted work and multiple implementations of the same functionality, none of
which is as good as it should be. There are solid reasons behind the
maintainers' tendency to push back against that kind of contribution;
without that pushback, the long-term maintainability of the kernel will
suffer.
How things will be resolved in the case of Open vSwitch remains to be
seen; the discussion is ongoing as of this writing. Open vSwitch is a
healthy and active project; it may well have the resources and the desire
to perform the necessary work to get into the mainline and ease its own
long-term maintenance burden. Meanwhile, as was discussed at the 2011
Kernel Summit, code that is being shipped and used has value; sometimes it
is best to get it into the mainline and focus on improving it afterward.
Some developers (such as Herbert Xu) seem
to think that may be the best approach to take in this case. So Open
vSwitch may yet find its way into the mainline in the near future with the
idea that its internals can be fixed up thereafter.
Comments (1 posted)
By Jonathan Corbet
November 29, 2011
Once upon a time, a "system on chip" (SOC) was a package containing a
processor and some number of I/O controllers. While SOCs still have all
that, manufacturers have been busy adding hardware support for all kinds of
interesting functionality. For example, OMAP4 processors have an onboard
face detection module that can be used for camera focus control, "face
unlock" features, and more. Naturally, there is interest in making use of
such features in Linux; a recent driver submission shows that the question
of just how to do that has not yet been answered, though.
The OMAP4 face recognition detection driver was
submitted by Tom Leiming, but was apparently written by Ming Lei. Upon
initialization, the driver allocates a memory area which is made available
to an application via mmap(). The application places an image in
that area (it seems that a 320x240 grayscale PGM image is the only supported
option), then uses a number of ioctl() operations to specify the
area of interest and to start and stop the image recognition process. A
read() on the device will, once detection is complete, yield a
number of structures describing the locations of the faces in the image as
rectangles.
Face detection functionality is clearly welcome, but this particular
driver has a lot of problems and will not get into the mainline in anything
resembling its current state. The most significant criticism, though, came
from Alan Cox, who asked that, rather than
being implemented as a standalone device, face detection be integrated
into the Video4Linux2 framework.
In truth, V4L2 is probably the right place for this feature. Face
detection is
generally meant to be used with the camera controller integrated into the
same SOC and the face detection hardware may be tightly tied to that
controller. The media controller subsystem was designed for
just this kind of functionality; it provides a mechanism by which camera
data may (or may not) be routed to the face detection module as needed.
Integration into V4L2 would bring the face detection module under the same
umbrella as the rest of the video processing hardware and export the
necessary data routing capabilities to user space.
The design of the user-space interface for this functionality seems likely to
pose challenges of its own, though. The OMAP4 hardware is
relatively simple in its operation; it appears to even lack the ability to
work with multiple image formats, even moderately high-resolution images,
or color data. Future hardware will certainly not be so limited. It is
also not hard to imagine a shift from detection of any face to
recognition of specific faces - or, at least, the generation of metrics to
ease the association of faces and the identities of their owners. The
hardware could become capable of blink detection, distinguishing real faces
from pictures of faces, or determining when a face belongs to a poker
player who is bluffing. Designing an API that can handle this kind of
functionality is going to be an interesting task.
But it does not stop there. There is a discouragingly large
market out there for devices capable of reading automobile license plates,
for example. There is money in meeting the needs of the contemporary
surveillance state, so manufacturers will certainly compete to provide the
needed capabilities. In general, the world is filled with interesting
things that are not faces; it is not hard to imagine that people will be
able to do useful things with devices that can pick all kinds of high-level
objects out of image data.
In general, we may be seeing a shift in what kinds of peripherals are
attached to our processors. There will always be plenty of devices that
serve essentially (from the CPU's point of view) as channels moving chunks
of data in one direction or the other. But there will be more and more
devices that offload some type of processing, and that is going to present
some interesting ABI challenges.
Hardware-based offload engines are nothing new, of course. But, once upon
a time, offload
devices mostly performed tasks otherwise handled by the operating system
kernel. Integrated controllers and network protocol offload functionality
are a couple of obvious examples. More recently, though, hardware has
provided functionality that needs to be made available to user space. And
that changes the game somewhat.
If one looks for examples of this kind of functionality, one almost
certainly needs to start at the GPU found in most graphics cards. Creating
a workable (and stable) user-space ABI providing access to the GPU has
taken many years, and it is not clear that the job is done yet. The media
controller ABI controls routing of data among the numerous interesting
functional units in contemporary video processors, but writing a
hardware-independent application using the media controller is hard.
Creating a workable interface for the wide variety of available industrial
sensors has also been a multi-year project.
Trying to anticipate where this kind of hardware will go in an attempt to
create the perfect ABI from the outset seems like an exercise in futility.
Most likely it will have to be done the way we've always done it: come up
with something that seems reasonable, learn (the hard way) what it's
shortcomings are, then begin the long process of replacing it with
something better. It is not an ideal way to create an operating system,
but it seems to be better than the alternatives. Figuring out the best way
to support face detection will just be another step in this ongoing
process.
Comments (16 posted)
By Jonathan Corbet
November 29, 2011
It may be tempting to see ext4 as last year's filesystem. It is solid and
reliable, but it is based on an old design; all the action is to be found
in next-generation filesystems like Btrfs. But it may be a while until
Btrfs earns the necessary level of confidence in the wider user community;
meanwhile, ext4's growing user base has not lost its appetite for
improvement. A few recently-posted patch sets show that the addition
of new features to ext4 has not stopped, even as that filesystem settles in
for a long period of stable deployments.
Bigalloc
In the early days of Linux, disk drives were still measured in megabytes
and filesystems worked with blocks of 1KB to 4KB in size. As this article
is being written,
terabyte disk drives are not quite as cheap as they recently were, but the
fact remains: disk drives have gotten a lot larger, as have the files
stored on them. But the ext4 filesystem still deals in 4KB blocks of data.
As a
result, there are a lot of blocks to keep track of, the associated
allocation bitmaps have grown, and the overhead of managing all those
blocks is significant.
Raising the filesystem block size in the kernel is a dauntingly difficult
task involving major changes to memory management, the page cache, and
more. It is not something anybody expects to see happen anytime soon. But
there is nothing preventing filesystem implementations from using larger
blocks on disk. As of the 3.2 kernel, ext4 will be capable of doing
exactly that. The "bigalloc" patch set adds the concept of "block clusters"
to the filesystem; rather than allocate single blocks, a filesystem using
clusters will allocate them in larger groups. Mapping between these larger
blocks and the 4KB blocks seen by the core kernel is handled entirely
within the filesystem.
The cluster size to use is set by the system administrator at filesystem
creation time (using a development version of e2fsprogs), but it must be a
power of two. A 64KB cluster size may make sense in a lot of situations;
for a filesystem that holds only very large files, a 1MB cluster size might
be the right choice. Needless to say, selecting a large cluster size for a
filesystem dominated by small files may lead to a substantial amount of
wasted space.
Clustering reduces the space overhead of the block bitmaps and other
management data structures. But, as Ted Ts'o documented back in July, it can also increase
performance in situations where large files are in use. Block allocation
times drop significantly, but file I/O performance also improves in general
as the result of reduced on-disk fragmentation. Expect this feature to
attract a lot of interest once the 3.2 kernel (and e2fsprogs 1.42) make
their way to users.
Inline data
An inode is a data structure describing a single file within a filesystem.
For most filesystems, there are actually two types of inode: the
filesystem-independent in-kernel
variety (represented by struct inode), and the
filesystem-specific on-disk version. As a general rule, the kernel cannot
manipulate a file in any way until it has a copy of the inode, so inodes,
naturally, are the focal point for a lot of block I/O.
In the ext4 filesystem, the size of on-disk inodes can be set when a
filesystem is created. The default size is 256 bytes, but the on-disk
structure (struct ext4_inode) only requires about half of that
space. The remaining space after the ext4_inode structure is
normally used to hold extended attributes. Thus, for example, SELinux
labels can be found there. On systems where extended attributes are not
heavily used, the space between on-disk inode structures may simply go to
waste.
Meanwhile, space for file data is allocated in units of blocks, separately
from the inode. If a file
is very small (and, even on current systems, there are a lot of small
files), much of the block used to hold that file will be wasted. If the
filesystem is using clustering, the amount of lost space will grow even
further, to the point that users may start to complain.
Tao Ma's ext4 inline data patches may
change that situation. The idea is quite simple: very small files can be
stored directly in the space between inodes without the need to allocate a
separate data block at all. On filesystems with 256-byte on-disk inodes,
the entire remaining space will be given over to the storage of small
files. If the filesystem is built with larger on-disk inodes, only half of
the leftover space will be used in this way, leaving space for
late-arriving extended attributes that would otherwise be forced out of the
inode.
Tao says that, with this patch set applied, the space required to store a
kernel tree drops by about 1%, and /usr gets about 3% smaller.
The savings on filesystems where clustering is enabled should be somewhat
larger, but those have not yet been quantified. There are a number of
details to be worked out yet - including e2fsck support and the potential
cost of forcing extended attributes to be stored outside of the inode - so
this feature is unlikely to be ready for inclusion before 3.4 at the
earliest.
Metadata checksumming
Storage devices are not always as reliable as we would like them to be;
stories of data corrupted by the hardware are not uncommon. For this
reason, people who care about their data make use of technologies like RAID
and/or filesystems like Btrfs which can maintain checksums of data and
metadata and ensure that nothing has been mangled by the drive. The ext4
filesystem, though, lacks this capability.
Darrick Wong's checksumming patch set does
not address the entire problem. Indeed, it risks reinforcing the old jest
that filesystem developers don't really care about the data they store as
long as the filesystem metadata is correct. This patch set seeks to
achieve that latter goal by attaching checksums to the various data
structures found on an ext4 filesystem - superblocks, bitmaps, inodes,
directory indexes, extent trees, etc. - and verifying that the checksums
match the data read from the filesystem later on. A checksum failure can
cause the filesystem to fail to mount or, if it happens on a mounted
filesystem, remount it read-only and issue pleas for help to the system
log.
Darrick makes no mention of any plans to add checksums for data as well.
In a number of ways, that would be a bigger set of changes; checksums are
relatively easy to add to existing metadata structures, but an entirely new
data structure would have to be added to the filesystem to hold data block
checksums. The performance impact of full-data checksumming would also be
higher. So, while somebody might attack that problem in the future, it
does not appear to be on anybody's list at the moment.
The changes to the filesystem are
significant, even for metadata-only checksums,
but the bulk of the work
actually went into e2fsprogs. In particular, e2fsck gains the ability to
check all of those checksums and, in some cases, fix things when the
checksum indicates that there is a problem. Checksumming can be
enabled with mke2fs and toggled with tune2fs. All told, it is a lot of
work, but it should help to improve confidence in the filesystem's
structure. According to Darrick, the overhead of the checksum calculation
and verification is not measurable in most situations. This feature has
not drawn a lot of comments this time around, and may be close to ready for
inclusion, but nobody has yet said when that might happen.
Comments (177 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>