Stable updates: the 2.6.32.47 and 2.6.33.20 stable updates were released on November 7; both contain a long list of important fixes. 2.6.32.48 was released on November 8 to fix some build problems introduced in 2.6.32.47. 2.6.33 users should note that 2.6.33.20 is the final planned update for that kernel.
Davidlohr's patch does not actually implement quotas; instead, it adds a new resource limit (RLIMIT_TMPFSQUOTA) controlling how much space a user can occupy on all mounted tmpfs systems. This is the approach requested in the wish list; it has some appeal because tmpfs is not a persistent filesystem. Normal filesystem implementations store quotas on the filesystem itself, but tmpfs cannot do that. So use of quotas would require that user space, in some fashion, reload the quota database on every boot (or, depending on the implementation, for every tmpfs mount). Resource limits look like a simpler situation.
Even so, there is opposition to the resource limit approach. Developers would rather see tmpfs behave like other filesystems. More to the point, perhaps, users and applications have some clue, some of the time, on how to respond to "quota exceeded" errors. Blown resource limits, instead, are on less solid ground. As Alan Cox pointed out, loading the quotas need not be a big problem; it could be as simple as a mount option setting a default quota for all users.
In the end, it seems unlikely that an implementation based on anything other than disk quotas will be merged, so this patch will need to be reworked.
User-visible changes merged since last week's summary include:
Changes visible to kernel developers include:
Despite the size of this development cycle, a number of trees ended up not being pulled. Linus explicitly avoided those that were controversial (FrontSwap and the KVM tool, for example); others seem to have simply been passed over. Some may slip in for -rc2, but, for the most part, the time has come to stabilize all of this code. If the usual pattern holds, the 3.2 release can be expected sometime around mid-January.
There are three separate pieces to the dynamic voltage and frequency scaling (DVFS) API, the first of which was actually merged for the 2.6.37 release. The "operating power points" module simply tracks the various operating levels available to a given device; the API is declared in <linux/opp.h>. Briefly, operating points are managed with:
int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt);
int opp_enable(struct device *dev, unsigned long freq);
int opp_disable(struct device *dev, unsigned long freq);
Operating points are enabled by default; a driver may disable specific points to reflect temperature or performance concerns. There is a set of functions for retrieving operating points above or below a given frequency, useful for moving up or down the power/performance scale.
A driver wanting to support DVFS on a specific device would start by filling in one of these structures (declared, along with the rest of the API, in <linux/devfreq.h>):
struct devfreq_dev_profile {
unsigned long initial_freq;
unsigned int polling_ms;
int (*target)(struct device *dev, unsigned long *freq);
int (*get_dev_status)(struct device *dev,
struct devfreq_dev_status *stat);
void (*exit)(struct device *dev);
};
Here initial_freq is, unsurprisingly, the original operating frequency of the device. Almost everything else in this structure is there to help frequency governors do their jobs. If polling_ms is non-zero, it tells the governor how often to poll the device to get its usage information; that polling will take the form of a call to get_dev_status(). That function should fill the stat structure with the relevant information:
struct devfreq_dev_status {
/* both since the last measure */
unsigned long total_time;
unsigned long busy_time;
unsigned long current_frequency;
void *private_data;
};
The governor will use this information to decide whether the current operating frequency should be changed or not. Should a change be needed, the target() callback will be called to change the operating point accordingly. This function should pick a frequency at least as high as the passed in *freq, then update *freq to reflect the actual frequency chosen. The exit() callback gives the driver a chance to clean things up if the DVFS layer decides to forget about the device.
Once the devfreq_dev_profile structure is filled in, the driver registers it with:
struct devfreq *devfreq_add_device(struct device *dev,
struct devfreq_dev_profile *profile,
const struct devfreq_governor *governor,
void *data);
If need be, a driver can supply its own governor to manage frequencies, but the kernel supplies a few of its own: devfreq_powersave (keeps the frequency as low as possible), devfreq_performance (keeps the frequency as high as possible), devfreq_userspace (allows control of the frequency through sysfs), and devfreq_simple_ondemand (tries to strike a balance between performance and power consumption).
The notifier mechanism built into the operating power points code can be used to automatically invoke the governor should the set of available power points change. There are a number of ways in which that change could come about; one of those is a change in expectations regarding how quickly the device can respond. For this case, 3.2 also gained an enhancement to the quality-of-service (pm_qos) code to handle per-device QOS requirements. Kernel code can express its QOS expectations for a device using these functions (all from <linux/pm_qos.h>):
int dev_pm_qos_add_request(struct device *dev, struct dev_pm_qos_request *req,
s32 value);
int dev_pm_qos_update_request(struct dev_pm_qos_request *req, s32 new_value);
int dev_pm_qos_remove_request(struct dev_pm_qos_request *req);
The dev_pm_qos_request structure is used as a handle for managing requests, but calling code does not need to access its internals. The passed value describes the desired quality of service; the documentation is surprisingly vague on just what the units of value are. It would appear to describe the desired latency, but the desired precision is unclear.
On the driver side, the notifier interface is used:
int dev_pm_qos_add_notifier(struct device *dev,
struct notifier_block *notifier);
int dev_pm_qos_remove_notifier(struct device *dev,
struct notifier_block *notifier);
When a device's quality-of-service requirements are changed, the notifier will be called with the new value. The driver can then adjust the available operating power points, disabling any that would render the device unable to meet the specified QOS requirement.
It is worth noting that none of the new code has any in-tree users as of this writing. That suggests that the interface might be more than usually volatile; once developers try to make use of this facility, they are likely to find things that can be improved. But, then, internal interfaces are always subject to change; regardless of any evolution here, the underlying capability should prove useful.
Slightly over a year ago, LWN reported on a couple of different kernel patches aimed at providing fast, or at least faster, interprocess communication (IPC): Cross Memory Attach (CMA) and kernel-dbus (kdbus). In one of the related email threads on the linux-kernel list, a third (pre-existing) kernel patch called KNEM was discussed. Meanwhile, yet another kernel module - "binder" used by the Android platform - is in use in millions of devices worldwide to provide fast IPC, and Linus recently observed that code that actually is used is the code that is actually worth something so maybe more of the Android code should be merged despite objections from some corners. Binder wasn't explicitly mentioned in that discussion but could reasonably be assumed to be included.
This article is not about whether any of these should be merged or not. That is largely an engineering and political decision in which this author claims no expertise, and in any case one of them - CMA - has already been merged. Rather we start with the observation that this many attempts to solve essentially the same problem suggests that something is lacking in Linux. There is, in other words, a real need for fast IPC that Linux doesn't address. The current approaches to filling this gap seem to be piecemeal attempts: Each patchset is clearly addressing the needs of a specific IPC model without obvious consideration for others. While this may solve current problems, it may not solve future problems, and one of the strengths of the design of Unix and hence Linux is the full exploitation of a few key ideas rather than the ad hoc accumulation of many distinct (though related) ideas.
So, motivated by that observation we will explore these various implementations to try to discover and describe the commonality they share and to highlight the key design decisions each one makes. Hopefully this will lead to a greater understanding of both the problem space and the solution space. Such understanding may be our best weapon against chaos in the kernel.
One of the interesting differences between the different IPC schemes is their mechanism for specifying the destination for a message.
CMA uses a process id (PID) number combined with offsets in the address space of that process - a message is simply copied to that location. This has the advantage of being very simple and efficient. PIDs are already managed by the kernel and piggy-backing on that facility is certainly attractive. The obvious disadvantage is that there is no room for any sophistication in access control, so messages can only be sent to processes with exactly the same credentials. This will not suit every context, but it is not a problem for the target area (the MPI message passing interface) which is aimed at massively parallel implementations in which all the processes are working together on one task. In that case having uniform credentials is an obvious choice.
KNEM uses a "cookie" which is a byte string provided by the kernel and which can be copied between processes. One process registers a region of memory with KNEM and receives a cookie in return. It can then pass this cookie to other processes as a simple byte string; the recipients can then copy to or from the registered region using that cookie as an address. Here again there is an assumption that the processes are co-operating and not a threat to each other (KNEM is also used for MPI). KNEM does not actually check process credentials directly, so any process that registers a region with KNEM is effectively allowing any other process that is able to use KNEM (i.e. able to open a specific character device file) to freely access that memory.
Kdbus follows the model of D-Bus and uses simple strings to direct messages. It monitors all D-Bus traffic to find out which endpoints own which names and then, when it sees a message sent to a particular name, it routes it accordingly rather than letting it go through the D-Bus daemon for routing.
Binder takes a very different approach from the other three. Rather than using names that appear the same to all processes, binder uses a kernel-internal object for which different processes see different object descriptors: small integers much like file descriptors. Each object is owned by a particular process (which can create new objects quite cheaply) and a message sent to an object is routed to the owning process. As each process is likely to have a different descriptor (or none at all) for the one object, descriptors cannot be passed as byte strings. However they can be passed along with binder messages much like file descriptors can be passed using Unix-domain sockets.
The main reason for using descriptors rather than names appears to involve reference counting. Binder is designed to work in an object-oriented system which (unsurprisingly) involves passing messages to objects, where the messages can contain references other objects. This is exactly the pattern seen in the kernel module. Any such system needs some way of determining when an object is no longer referenced, the typical approaches being garbage collection and reference counting. Garbage collection across multiple different processes is unlikely to be practical, so reference counting is the natural choice. As binder allows communication between mutually suspicious processes, there needs to be some degree of enforcement: a process should not be able to send a message when it doesn't own a reference to the target, and when a process dies, all its references should be released. To ensure these rules are met it is hard to come up with any scheme much simpler than the one used by binder.
Possibly the most interesting observation here is that two addressing schemes used widely in Linux are completely missing in these implementations: file descriptors and socket addresses (struct sockaddr).
File descriptors are used for pipes (the original UNIX IPC), for socket pairs and other connected sockets, for talking to devices, and much more. It is not hard to imagine them being used by CMA, and binder too. They are appealing as they can be used with simple read() and write() calls and similar standard interfaces. The most likely reason that they are regularly avoided is their cost - they are not exactly lightweight. On an x86_64 system a struct file - the minimum needed for each file descriptor - is 288 bytes. Of these, maybe 64 are relevant to many novel use cases, the rest is dead weight. This weight could possibly be reduced by a more object-oriented approach to struct file but such a change would be very intrusive and is unlikely to happen. So finding other approaches is likely to become common. We see that already in the inotify subsystem which has "watch descriptors"; we see it here in binder too.
The avoidance of socket addresses does not seem to admit such a neat answer. In the cases of CMA, kdbus, and binder it doesn't seem to fit the need for various different reasons. For KNEM it seems best explained as arbitrary choice. The developer chose to write a new character device rather than a new networking domain (aka address family) and so used ioctl() and ad hoc addresses rather than sendmsg()/recvmsg() and socket addresses.
The conclusion here seems to be that there is a constant tension between protection and performance. Every step we take to control what one process can do to another by building meaning into an address adds extra setup cost and management cost. Possibly the practical approach is not to try to choose between them but to unify them and allow each client to choose. So a client could register itself with an external address that any other process can use if it knows it, or with an internal address (like the binder objects) which can only be used by a process that has explicitly been given it. Further, a registered address may only accept explicit messages, or may be bound to a memory region that other processes can read and write directly. If such addresses and messages could be used interchangeably in the one domain it might allow a lot more flexibility for innovation.
One area where kdbus stands out from the rest is in support for a publish/subscribe interface. Each of the higher level IPC services (MPI, Binder, D-Bus) have some sort of multicast or broadcast facility, but only kdbus tries to bring it into the kernel. This could simply reflect the fact that multicast does not need to be optimized and can be adequately handled in user space. Alternately it could mean that implementing it in the kernel is too hard so few people try.
There are two ways we can think about implementing a publish/subscribe mechanism. The first follows the example of IP multicast where a certain class of addresses is defined to be multicast addresses and sockets can request to receive multicasts to selected addresses. Binder does actually have a very limited form of this. Any binder client can ask to be notified when a particular object dies; when a client closes its handle on the binder (e.g. when it exits) all the objects it owns die and messages are accordingly published for all clients who have subscribed to that object. It would be tempting to turn this into a more general publish/subscribe scheme.
The second way to implement publish/subscribe is through a mechanism like the Berkeley packet filter that the networking layer provides. This allows a socket to request to receive all messages, but the filter removes some of them based on content following an almost arbitrary program (which can now be JIT compiled). This is more in line with the approach that kdbus uses. D-Bus allows clients to present "match" rules such that they receive all messages with content that matches the rules. kdbus extracts those rules by monitoring D-Bus traffic and uses them to perform multicast routing in the kernel.
Alban Crequy, the author of kdbus, appears to have been exploring both of these approaches. It would be well worth considering this effort in any new fast-IPC mechanism introduced into Linux to ensure it meets all use cases well.
A recurring goal in many efforts at improving communication speed is to reduce the number of times that message data is copied in transit. "Zero-copy" is sometime seen as the holy-grail and, while it is usually impractical to reach that, single-copy can be attained; three of our four examples do achieve it. The fourth, kdbus, doesn't really try to achieve single-copy. The standard D-Bus mechanism is four copies - sender to kernel to daemon to kernel to receiver. Kdbus reduces this to two copies (and more particularly reduces context-switches to one) which is quite an improvement. The others all aim for single-copy operation.
CMA and KNEM achieve single-copy performance by providing a system call which simply copies from one address space to the other with various restrictions as we have already seen. This is simple, but not secure in a hostile environment. Binder is, again, quite different. With binder, part of the address space of each process is managed by the binder module through the process calling mmap() on the binder file descriptor. Binder then allocates pages and places them in the address space as required.
This mapped memory is read-only to the process, all writing is performed by the kernel. When a message is sent from one process to another the kernel allocates some space in the destination process's mapped area, copies the message directly from the sending process, and then queues a short message to the receiving process telling it where the received message is. The recipient can then access that message directly and will ultimately tell the binder module that it is finished with the message and that the memory can be reused.
While this approach may seem a little complex - having the kernel effectively provide a malloc() implementation (best fit as it happens) for the receiving process - it has the particular benefit that it requires no synchronization between the sender and the recipient. The copy happens immediately for the sender and it can then move on assuming it is complete. The receiver doesn't need to know anything about the message until it is all there ready and waiting (much better to have the message waiting than the processes waiting).
This asynchronous behavior is common to all the single-copy mechanisms, which makes one wonder if using Linux's AIO (Asynchronous Input/Output) subsystem might provide another possible approach. The sender could submit an asynchronous write, the recipient an asynchronous read, and when the second of the two arrives the copy is performed and each is notified. One unfortunate, though probably minor, issue with this approach is that, while Linux-aio can submit multiple read and write requests in a single system call and can receive multiple completion notifications in another system call, it cannot do both in one. This contrasts with the binder which has a WRITE_READ ioctl() command that sends messages and then waits for the reply, allowing an entire transaction to happen in a single system call. As we have seen with addition of recvmmsg() and, more recently, sendmmsg(), doing multiple things in a single system call has real advantages. As Dave Miller once observed:
All of the high-level APIs for IPC make a distinction between requests and replies, connecting them in some way to form a single transaction. Most of the in-kernel support for messaging doesn't preserve this distinction with any real clarity. Messages are just messages and it is up to user space to determine how they are interpreted. The binder module is again an exception; understanding why helps expose an important aspect of the binder approach.
Though the code and the API do not present it exactly like this, the easiest way to think about the transaction tracking in binder is to imagine that each message has a "transaction ID" label. A request and its reply will have the same label. Further, if the recipient of the message finds that it needs to make another IPC before it can generate a final reply, it will use the same label on this intermediate IPC, and will obviously expect it on the intermediate reply.
With this labeling in place, Binder allows (and in fact requires) a thread which has sent a message, and which is waiting for a reply to that message, to only receive further messages with the same transaction ID. This rule allows a thread to respond to recursive calls and, thus, allow that thread's own original request to progress, but causes it to ignore any new calls until the current one is complete. If a process is multithreaded, each thread can work on independent transactions separately, but a single thread is tied to one complex transaction at a time.
Apart from possibly simplifying the user-space programming model, this allows the transaction as a whole to have a single CPU scheduling priority inherited from the originating process. Binder presents a model that there is just one thread of control involved in a method call, but that thread may wander from one address space to another to carry out different parts of the task. This migration of process priority allows that model to be more fully honored.
While many of the things that binder does are "a bit different", this is probably the most unusual. Having the same open file descriptor behave differently in different threads is not what most of us would expect. Yet it seems to be a very effective way to implement an apparently useful feature. Whether this feature is truly generally useful, and whether or not there is a more idiomatic way to provide it in Linux are difficult questions. However they are questions that need to be addressed if we want the best possible high-speed IPC in our kernel of choice.
There is certainly no shortage of interesting problems to solve in the Linux kernel, and equally no shortage of people with innovative and creative solutions. Here we have seen four quite different approaches to one particular problem and how each brings value of one sort or another. However each could probably be improved by incorporating ideas and approaches from one of the others, or by addressing needs that others present.
My hope is that by exposing and contrasting the different solutions and the problems they address, we can take a step closer to finding unifying solutions that address both today's needs and the needs for our grandchildren.
Page editor: Jonathan Corbet
Next page:
Distributions>>
Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds