Brief items
The current development kernel is 3.2-rc1,
released on November 7. "
Have fun,
give it a good testing. There shouldn't be anything hugely scary in there,
but there *is* a lot of stuff. The fact that 3.1 dragged out did mean that
this ended up being one of the bigger merge windows, but I'm not feeling
*too* nervous about it." There's a new code name - Saber-toothed
squirrel - to go with it.
Stable updates: the 2.6.32.47 and 2.6.33.20 stable updates were released on
November 7; both contain a long list of important fixes. 2.6.32.48 was released on November 8 to
fix some build problems introduced in 2.6.32.47. 2.6.33 users
should note that 2.6.33.20 is the final planned update for that kernel.
Comments (none posted)
Crash test dummy folds.
KVM mafia wins.
Innovation cries.
--
Dan Magenheimer
Seriously, if someone gave me a tools/term/ tool that has
rudimentary xterm functionality with tabbing support, written in
pure libdri and starting off a basic fbcon console and taking over
the full screen, i'd switch to it within about 0.5 nanoseconds and
would do most of my daily coding there and would help out with
extending it to more apps (starting with a sane mail client
perhaps).
--
Ingo Molnar
Comments (1 posted)
By Jonathan Corbet
November 9, 2011
The second version of the
plumber's
wish list for Linux included a request for support for usage quotas on
the tmpfs filesystem. Current kernels have no such support, making it easy
for local users to execute denial-of-service attacks by filling up
/tmp or
/dev/shm. Davidlohr Bueso answered that call
with
a patch providing that support. But
it turns out that there is a disagreement over how tmpfs use limits should be
managed.
Davidlohr's patch does not actually implement quotas; instead, it adds a
new resource limit (RLIMIT_TMPFSQUOTA) controlling how much space
a user can occupy on all
mounted tmpfs systems. This is the approach requested in the wish list; it
has some appeal because tmpfs is not a persistent filesystem. Normal
filesystem implementations store quotas on the filesystem itself, but tmpfs
cannot do that. So use of quotas would require that user space, in some
fashion, reload the quota database on every boot (or, depending on the
implementation, for every tmpfs mount). Resource limits look like a
simpler situation.
Even so, there is opposition to the resource limit approach. Developers
would rather see tmpfs behave like other filesystems. More to the point,
perhaps, users and applications have some clue, some of the time, on how to
respond to "quota exceeded" errors. Blown resource limits, instead, are
on less solid ground. As Alan Cox pointed
out, loading the quotas need not be a big problem; it could be as
simple as a mount option setting a default quota for all users.
In the end, it seems unlikely that an implementation based on anything
other than disk quotas will be merged, so this patch will need to be
reworked.
Comments (5 posted)
Kernel development news
By Jonathan Corbet
November 8, 2011
Linus
announced the 3.2-rc1 release and
closed the merge window on November 7. During the two-week window,
some 10,214 non-merge changesets were pulled into the mainline kernel.
That is the most active merge window ever, edging past the previous record
holder (2.6.30, at 9,603 changesets) by a fair amount. The delay in the
start of this development cycle will certainly have caused more work to
pile up, but there was also, clearly, just a lot of work going on.
User-visible changes merged since last week's
summary include:
- The device mapper has a new "thin provisioning" capability which,
among other things, offers improved snapshot support. This feature is
considered experimental in 3.2. See Documentation/device-mapper/thin-provisioning.txt
for information on how it works. Also added to the device mapper is a
"bufio" module that adds another layer of buffering between the system
and a block device; the thin provisioning code is the main user of
this feature.
- There is a new memory-mapped virtio device intended to allow
virtualized guests to use virtio-based block and network devices in
the absence of PCI support.
- It is now possible for a process to use poll() on files under
/proc/sys; the result is the ability to get a notification
when a specific sysctl parameter changes.
- The btrfs filesystem now records a number of previous tree roots which
can be useful in recovering damaged filesystems; see this article for more information. Btrfs
has also gained improved readahead support.
- The I/O-less dirty throttling patch
set has been merged; that should improve writeback performance for a
number of workloads.
- New drivers include:
- Processors and systems:
Freescale P3060 QDS boards and
non-virtualized PowerPC systems.
- Block:
M-Systems Disk-On-Chip G3 MTD controllers.
- Media:
MaxLinear MXL111SF DVB-T demodulators,
Abilis AS102 DVB receivers, and
Samsung S5K6AAFX sensors.
- Miscellaneous:
Intel Sandybridge integrated memory controllers,
Intel Medfield MSIC (audio/battery/GPIO...) controllers,
IDT Tsi721 PCI Express SRIO (RapidIO) controllers,
GPIO-based pulse-per-second clients, and
STE hardware semaphores.
- Graduations: the Conexant cx25821 V4L2 driver has
moved from staging into the mainline.
Changes visible to kernel developers include:
- The new GENHD_FL_NO_PART_SCAN device flag suppresses the
normal partition
scan when a new block device is added to the system.
- The venerable block layer function __make_request() has been
renamed to blk_queue_bio() and exported to modules.
- The TAINT_OOT_MODULE taint flag is now set when out-of-tree
modules are inserted into the kernel. Naturally, the module itself
tells the kernel about its provenance, so this mechanism can be
circumvented, but anybody trying to do that would certainly be caught
and publicly shamed sooner or later.
- A few macros (EXPORT_SYMBOL_* and THIS_MODULE) have
been split out of <linux/module.h> and placed in
<linux/export.h>. Code that only needs to export
symbols can now use the latter include file; the result is a reduction
in kernel compile time.
Despite the size of this development cycle, a number of trees ended up not
being pulled. Linus explicitly avoided those that were controversial
(FrontSwap and the KVM tool, for example);
others seem to have simply been passed over. Some may slip in for -rc2,
but, for the most part, the time has come to stabilize all of this code.
If the usual pattern holds, the 3.2 release can be expected sometime around
mid-January.
Comments (3 posted)
By Jonathan Corbet
November 8, 2011
The Linux kernel has long had the ability to regulate the CPU's voltage and
frequency for optimal behavior, where "optimal" is a function of both
performance and power consumption. But a system is more than just a CPU,
and there are many other components which are able to run at multiple
performance levels. It is unsurprising that a proper infrastructure for
managing device operating points has lagged that for the CPU, since the
amount of power to be saved is usually smaller. But now that CPU power
behavior is fairly well optimized, the power infrastructure is growing to
encompass the rest of the system. The 3.2 kernel will have a new set of
APIs intended to allow drivers to let the system find the best operating
level for the devices they manage.
There are three separate pieces to the dynamic voltage and frequency
scaling (DVFS) API, the first of which was
actually merged for the 2.6.37 release. The "operating power points"
module simply tracks the various operating levels available to a given
device; the API is declared in <linux/opp.h>. Briefly,
operating points are managed with:
int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt);
int opp_enable(struct device *dev, unsigned long freq);
int opp_disable(struct device *dev, unsigned long freq);
Operating points are enabled by default; a driver may disable specific
points to reflect temperature or performance concerns. There is a set of
functions for retrieving operating points above or below a given frequency,
useful for moving up or down the power/performance scale.
A driver wanting to support DVFS on a specific device would start by
filling in one of these
structures (declared, along with the rest of the API, in
<linux/devfreq.h>):
struct devfreq_dev_profile {
unsigned long initial_freq;
unsigned int polling_ms;
int (*target)(struct device *dev, unsigned long *freq);
int (*get_dev_status)(struct device *dev,
struct devfreq_dev_status *stat);
void (*exit)(struct device *dev);
};
Here initial_freq is, unsurprisingly, the original operating
frequency of the device. Almost everything else in this structure is there
to help frequency governors do their jobs. If polling_ms is
non-zero, it tells the governor how often to poll the device to get its
usage information; that polling will take the form of a call to
get_dev_status(). That function should fill the stat
structure with the relevant information:
struct devfreq_dev_status {
/* both since the last measure */
unsigned long total_time;
unsigned long busy_time;
unsigned long current_frequency;
void *private_data;
};
The governor will use this information to decide whether the current
operating frequency should be changed or not. Should a change be needed,
the target() callback will be called to change the operating point
accordingly. This function should pick a frequency at least as high as the
passed in *freq, then update *freq to reflect the actual
frequency chosen. The exit() callback gives the driver a chance
to clean things up if the DVFS layer decides to forget about the device.
Once the devfreq_dev_profile structure is filled in, the driver
registers it with:
struct devfreq *devfreq_add_device(struct device *dev,
struct devfreq_dev_profile *profile,
const struct devfreq_governor *governor,
void *data);
If need be, a driver can supply its own governor to manage frequencies, but
the kernel supplies a few of its own: devfreq_powersave (keeps
the frequency as low as possible), devfreq_performance (keeps the
frequency as high as possible), devfreq_userspace (allows control
of the frequency through sysfs), and devfreq_simple_ondemand
(tries to strike a balance between performance and power consumption).
The notifier mechanism built into the operating power points code can be
used to automatically invoke the governor should the set of available power
points change. There are a number of ways in which that change could come
about; one of those is a change in expectations regarding how quickly the
device can respond. For this case, 3.2 also gained an enhancement to the
quality-of-service (pm_qos) code to handle
per-device QOS requirements. Kernel code can express its QOS expectations
for a device using these functions (all from
<linux/pm_qos.h>):
int dev_pm_qos_add_request(struct device *dev, struct dev_pm_qos_request *req,
s32 value);
int dev_pm_qos_update_request(struct dev_pm_qos_request *req, s32 new_value);
int dev_pm_qos_remove_request(struct dev_pm_qos_request *req);
The dev_pm_qos_request structure is used as a handle for managing
requests, but calling code does not need to access its internals. The
passed value describes the desired quality of service; the
documentation is surprisingly vague on just what the units of
value are. It would appear to describe the desired latency, but
the desired precision is unclear.
On the driver side, the notifier interface is used:
int dev_pm_qos_add_notifier(struct device *dev,
struct notifier_block *notifier);
int dev_pm_qos_remove_notifier(struct device *dev,
struct notifier_block *notifier);
When a device's quality-of-service requirements are changed, the notifier
will be called with the new value. The driver can then adjust the
available operating power points, disabling any that would render the
device unable to meet the specified QOS requirement.
It is worth noting that none of the new code has any in-tree users as of
this writing. That suggests that the interface might be more than usually
volatile; once developers try to make use of this facility, they are likely
to find things that can be improved. But, then, internal interfaces are
always subject to change; regardless of any evolution here, the underlying
capability should prove useful.
Comments (2 posted)
November 9, 2011
This article was contributed by Neil Brown
Slightly over a year ago, LWN reported on a
couple of different kernel patches aimed at providing fast, or at least
faster, interprocess communication (IPC): Cross Memory Attach (CMA) and
kernel-dbus (kdbus). In one of the related email threads on
the linux-kernel list, a third (pre-existing) kernel patch called KNEM was discussed.
Meanwhile, yet another kernel module - "binder" used by the Android
platform - is in use in millions of devices worldwide to provide fast IPC,
and Linus recently observed that code that
actually is used is the code that is actually worth something so maybe
more of the Android code should be merged despite objections from some
corners. Binder wasn't explicitly mentioned in that discussion but could
reasonably be assumed to be included.
This article is not about whether any of these should be merged or not. That
is largely an engineering and political decision in which this author claims no
expertise, and in any case one of them - CMA - has already been
merged.
Rather we start with the observation that this many attempts to solve
essentially the same problem suggests that something is lacking in Linux.
There is, in other words, a real need for fast IPC that Linux doesn't address. The
current approaches to filling this gap seem to be piecemeal attempts: Each
patchset
is clearly addressing the needs of a specific IPC model without obvious
consideration for others. While this may solve current problems, it may not
solve future problems, and one of the strengths of the design of Unix and hence
Linux is the
full exploitation of a few key ideas rather
than the ad hoc accumulation of many distinct (though related)
ideas.
So, motivated by that observation we will explore these various implementations
to try to discover and describe the commonality they share and to highlight the
key design decisions each one makes. Hopefully this will lead to a greater
understanding of both the problem space and the solution space. Such
understanding may be our best weapon against chaos in the kernel.
What's your address?
One of the interesting differences between the different IPC schemes is
their mechanism for specifying the destination for a message.
CMA uses a process id (PID) number combined with offsets in the address space
of that process - a message is simply copied to that location. This has the
advantage of being very simple and efficient. PIDs are already managed by the
kernel and piggy-backing on that facility is certainly attractive. The obvious
disadvantage is that there is no room for any sophistication in access control,
so messages can only be sent to processes with exactly the same credentials.
This will not suit every context, but it is not a problem for the target area
(the MPI message passing interface) which is aimed at massively parallel
implementations in which all the processes are working together on one task.
In that case having uniform credentials is an obvious choice.
KNEM uses a "cookie" which is a byte string provided by the kernel and
which can be
copied between processes. One process registers a region of memory with KNEM
and receives a cookie in return. It can then pass this cookie to other
processes as a simple byte string; the recipients can then copy to or from the
registered region using that cookie as an address. Here again there is an
assumption that the processes are co-operating and not a threat to each other
(KNEM is also used for MPI). KNEM does not actually check process credentials
directly, so any process that registers a region with KNEM is effectively
allowing any other process that is able to use KNEM (i.e. able to open a
specific character device file) to freely access that memory.
Kdbus follows the model of D-Bus and uses simple strings to direct messages. It
monitors all D-Bus traffic to find out which endpoints own which names and then,
when it sees a message sent to a particular name, it routes it accordingly
rather than letting it go through the D-Bus daemon for routing.
Binder takes a very different approach from the other three. Rather than using
names that appear the same to all processes, binder uses a kernel-internal
object for which different processes see different object descriptors: small
integers much like file descriptors. Each object is owned by a particular
process (which can create new objects quite cheaply) and a message sent to an
object is routed to the owning process. As each process is likely to have a
different descriptor (or none at all) for the one object, descriptors cannot be
passed as byte strings. However they can be passed along with binder messages
much like file descriptors can be passed using Unix-domain sockets.
The main reason for using descriptors rather than names appears to
involve reference counting. Binder is designed to work in an object-oriented
system which (unsurprisingly) involves passing messages to objects,
where the messages can contain references other objects.
This is exactly the pattern seen in the kernel module. Any such system
needs some way
of determining when an object is no longer referenced, the typical
approaches being garbage collection and reference counting. Garbage collection
across multiple different processes is unlikely to be practical, so reference
counting is the natural choice. As binder allows communication between
mutually suspicious processes, there needs to be some degree of enforcement: a
process should not be able to send a message when it doesn't own a reference to
the target, and when a process dies, all its references should be released. To
ensure these rules are met it is hard to come up with any scheme much simpler
than the one used by binder.
Possibly the most interesting observation here is that two addressing schemes
used widely in Linux are completely missing in these implementations: file
descriptors and socket addresses (struct sockaddr).
File descriptors are used for pipes (the original UNIX IPC), for socket pairs
and other connected sockets, for talking to devices, and much more. It is not
hard to imagine them being used by CMA, and binder too. They are
appealing as they can be used with simple read() and write()
calls and similar standard interfaces. The most likely reason that they are
regularly avoided is their cost - they are not exactly lightweight. On
an x86_64 system a struct file - the minimum needed for each
file descriptor - is 288 bytes. Of these, maybe 64 are relevant to many novel
use cases, the rest is dead weight. This weight could possibly be reduced by a
more object-oriented approach to struct
file but such a change would be very intrusive and is unlikely to happen.
So finding other approaches is likely to become common. We see that already
in the inotify subsystem which has "watch descriptors"; we see it here
in binder too.
The avoidance of socket addresses does not seem to admit such a neat
answer. In the
cases of CMA, kdbus, and binder it doesn't seem to fit the need for various
different reasons. For KNEM it seems best explained as arbitrary choice. The
developer chose to write a new character device rather than a new networking
domain (aka address family) and so used ioctl() and ad
hoc addresses
rather than sendmsg()/recvmsg() and socket addresses.
The conclusion here seems to be that there is a constant tension between
protection and performance. Every step we take to control what one process can
do to another by building meaning into an address adds extra setup cost and
management cost. Possibly the practical approach is not to try to choose
between them but to unify them and allow each client to choose. So a
client could register itself with an external address that any other process
can use if it knows it, or with an internal address (like the binder objects)
which can only be used by a process that has explicitly been given it. Further,
a registered address may only accept explicit messages, or may be bound to a
memory region that other processes can read and write directly. If such
addresses and messages could be used interchangeably in the one domain it might
allow a lot more flexibility for innovation.
Publish and subscribe
One area where kdbus stands out from the rest is in support for a
publish/subscribe interface.
Each of the higher level IPC services (MPI, Binder, D-Bus) have some sort of
multicast or broadcast facility, but only kdbus tries to bring it into the
kernel. This could simply reflect the fact that multicast
does not need to be optimized and can be adequately handled in user space.
Alternately it could mean that implementing it in the kernel is too hard so few
people try.
There are two ways we can think about implementing a publish/subscribe
mechanism. The first follows the example of IP multicast where a certain class
of addresses is defined to be multicast addresses and sockets can
request to receive multicasts to selected addresses. Binder does actually
have a very
limited form of this. Any binder client can ask to be notified when a
particular object dies; when a client closes its handle on the binder
(e.g. when it exits) all the objects it owns die and messages are accordingly
published for all clients who have subscribed to that object. It would be
tempting to turn this into a more general publish/subscribe scheme.
The second way to implement publish/subscribe is through a mechanism like
the Berkeley packet filter that the networking layer
provides. This allows a socket to request to receive all messages, but the
filter removes some of them based on content following an almost arbitrary
program (which can now be JIT compiled). This
is more in line with the approach that kdbus uses. D-Bus allows clients to
present "match" rules such that they receive all messages with content that
matches the rules. kdbus extracts those rules by monitoring D-Bus traffic and
uses them to perform multicast routing in the kernel.
Alban Crequy, the author of kdbus, appears to have been
exploring
both of these approaches. It would be well worth considering this effort
in any new
fast-IPC mechanism introduced into Linux to ensure it meets all use cases well.
Single copy
A recurring goal in many efforts at improving communication speed is to reduce
the number of times that message data is copied in transit. "Zero-copy" is
sometime seen as the holy-grail and, while it is usually
impractical to reach that, single-copy can be attained; three of our four
examples do achieve it.
The fourth,
kdbus, doesn't really try to achieve single-copy. The standard D-Bus mechanism
is four copies - sender to kernel to daemon to kernel to receiver. Kdbus
reduces this to two copies (and more particularly reduces context-switches
to one) which
is quite an improvement. The others all aim for single-copy operation.
CMA and KNEM achieve single-copy performance by providing a system call
which simply copies
from one address space to the other with various restrictions as we have
already seen. This is simple, but not secure in a hostile environment.
Binder is, again, quite different. With binder, part of the address space of
each process is managed by the binder module through the process calling
mmap()
on the binder file descriptor. Binder then allocates pages and places them in
the address space as required.
This mapped memory is read-only to the process, all writing is performed by the
kernel. When a message is sent from one process to another the kernel
allocates some space in the destination process's mapped area, copies the
message directly
from the sending process, and then queues a short message to the receiving
process telling it where the received message is. The recipient can then access
that message directly and will ultimately tell the binder module that it is
finished with the message and that the memory can be reused.
While this approach may seem a little complex - having the kernel effectively
provide a malloc() implementation (best fit as it happens) for the
receiving process - it has the particular benefit that it requires no
synchronization between the sender and the recipient. The copy happens
immediately for the sender and it can then move on assuming it is complete.
The receiver doesn't need to know anything about the message until it is all
there ready and waiting (much better to have the message waiting than
the processes waiting).
This asynchronous behavior is common to all the single-copy mechanisms, which
makes one wonder if using Linux's AIO (Asynchronous Input/Output) subsystem
might provide
another possible approach. The sender could submit an asynchronous write, the
recipient an asynchronous read, and when the second of the two arrives the copy
is performed and each is notified. One unfortunate, though probably minor,
issue with this approach is that, while Linux-aio can submit multiple read and
write requests in a single system call and can receive multiple completion
notifications in another system call, it cannot do both in one. This contrasts
with the binder which has a WRITE_READ ioctl() command
that sends messages and then
waits for the reply, allowing an entire transaction to happen in a single system
call. As we have seen with addition of
recvmmsg() and, more recently,
sendmmsg(), doing multiple things in a single
system call has real advantages. As Dave Miller
once observed:
The old adage about syscalls being cheap no longer holds when
we're talking about traversing all the way into the protocol
stack socket code every call, taking the socket lock every
time, etc.
Tracking transactions
All of the high-level APIs for IPC make a distinction between requests and
replies, connecting them in some way to form a single transaction. Most of the
in-kernel support for messaging doesn't preserve this distinction with any real
clarity. Messages are just messages and it is up to user space to
determine how they are
interpreted. The binder module is again an exception; understanding why
helps expose an important aspect of the binder approach.
Though the code and the API do not present it exactly like this, the easiest
way to think about the transaction tracking in binder is to imagine that each
message has a "transaction ID" label. A request and its reply will
have the same label. Further, if the recipient of the message finds that it
needs to make another IPC before it can generate a final reply, it will use the
same label on this intermediate IPC, and will obviously expect it on the
intermediate reply.
With this labeling in place, Binder allows (and in fact requires) a thread
which has sent a message, and which is waiting for a reply to that message,
to only receive further messages with the same transaction ID. This rule
allows a thread to respond to recursive calls and, thus, allow that
thread's own original request to progress, but causes it to ignore any new
calls until the current one is complete. If a process is multithreaded,
each thread can work on independent transactions separately, but a single
thread is tied to one complex transaction at a time.
Apart from possibly simplifying the user-space programming model, this allows
the transaction as a whole to have a single CPU scheduling priority inherited
from the originating process. Binder presents a model that there is just one
thread of control involved in a method call, but that thread may wander
from one address
space to another to carry out different parts of the task. This migration of
process priority allows that model to be more fully honored.
While many of the things that binder does are "a bit different", this is
probably the most unusual. Having the same open file descriptor behave
differently in different threads is not what most of us would expect. Yet it
seems to be a very effective way to implement an apparently useful feature.
Whether this feature is truly generally useful, and whether or not there is a
more idiomatic way to provide it in Linux are difficult questions. However
they are questions that need to be addressed if we want the best possible
high-speed IPC in our kernel of choice.
Inter-Programmer Communication
There is certainly no shortage of interesting problems to solve in the Linux
kernel, and equally no shortage of people with innovative and creative
solutions. Here we have seen four quite different approaches to one particular
problem and how each brings value of one sort or another. However each could
probably be improved by incorporating ideas and approaches from one of the
others, or by addressing needs that others present.
My hope is that by exposing and contrasting the different solutions and the
problems they address, we can take a step closer to finding unifying solutions
that address both today's needs and the needs for our grandchildren.
Comments (16 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>