|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.2-rc1, released on November 7. "Have fun, give it a good testing. There shouldn't be anything hugely scary in there, but there *is* a lot of stuff. The fact that 3.1 dragged out did mean that this ended up being one of the bigger merge windows, but I'm not feeling *too* nervous about it." There's a new code name - Saber-toothed squirrel - to go with it.

Stable updates: the 2.6.32.47 and 2.6.33.20 stable updates were released on November 7; both contain a long list of important fixes. 2.6.32.48 was released on November 8 to fix some build problems introduced in 2.6.32.47. 2.6.33 users should note that 2.6.33.20 is the final planned update for that kernel.

Comments (none posted)

Quotes of the week

Crash test dummy folds.
KVM mafia wins.
Innovation cries.
-- Dan Magenheimer

Seriously, if someone gave me a tools/term/ tool that has rudimentary xterm functionality with tabbing support, written in pure libdri and starting off a basic fbcon console and taking over the full screen, i'd switch to it within about 0.5 nanoseconds and would do most of my daily coding there and would help out with extending it to more apps (starting with a sane mail client perhaps).
-- Ingo Molnar

Comments (1 posted)

Quotas for tmpfs

By Jonathan Corbet
November 9, 2011
The second version of the plumber's wish list for Linux included a request for support for usage quotas on the tmpfs filesystem. Current kernels have no such support, making it easy for local users to execute denial-of-service attacks by filling up /tmp or /dev/shm. Davidlohr Bueso answered that call with a patch providing that support. But it turns out that there is a disagreement over how tmpfs use limits should be managed.

Davidlohr's patch does not actually implement quotas; instead, it adds a new resource limit (RLIMIT_TMPFSQUOTA) controlling how much space a user can occupy on all mounted tmpfs systems. This is the approach requested in the wish list; it has some appeal because tmpfs is not a persistent filesystem. Normal filesystem implementations store quotas on the filesystem itself, but tmpfs cannot do that. So use of quotas would require that user space, in some fashion, reload the quota database on every boot (or, depending on the implementation, for every tmpfs mount). Resource limits look like a simpler situation.

Even so, there is opposition to the resource limit approach. Developers would rather see tmpfs behave like other filesystems. More to the point, perhaps, users and applications have some clue, some of the time, on how to respond to "quota exceeded" errors. Blown resource limits, instead, are on less solid ground. As Alan Cox pointed out, loading the quotas need not be a big problem; it could be as simple as a mount option setting a default quota for all users.

In the end, it seems unlikely that an implementation based on anything other than disk quotas will be merged, so this patch will need to be reworked.

Comments (5 posted)

Kernel development news

The second half of the 3.2 merge window

By Jonathan Corbet
November 8, 2011
Linus announced the 3.2-rc1 release and closed the merge window on November 7. During the two-week window, some 10,214 non-merge changesets were pulled into the mainline kernel. That is the most active merge window ever, edging past the previous record holder (2.6.30, at 9,603 changesets) by a fair amount. The delay in the start of this development cycle will certainly have caused more work to pile up, but there was also, clearly, just a lot of work going on.

User-visible changes merged since last week's summary include:

  • The device mapper has a new "thin provisioning" capability which, among other things, offers improved snapshot support. This feature is considered experimental in 3.2. See Documentation/device-mapper/thin-provisioning.txt for information on how it works. Also added to the device mapper is a "bufio" module that adds another layer of buffering between the system and a block device; the thin provisioning code is the main user of this feature.

  • There is a new memory-mapped virtio device intended to allow virtualized guests to use virtio-based block and network devices in the absence of PCI support.

  • It is now possible for a process to use poll() on files under /proc/sys; the result is the ability to get a notification when a specific sysctl parameter changes.

  • The btrfs filesystem now records a number of previous tree roots which can be useful in recovering damaged filesystems; see this article for more information. Btrfs has also gained improved readahead support.

  • The I/O-less dirty throttling patch set has been merged; that should improve writeback performance for a number of workloads.

  • New drivers include:

    • Processors and systems: Freescale P3060 QDS boards and non-virtualized PowerPC systems.

    • Block: M-Systems Disk-On-Chip G3 MTD controllers.

    • Media: MaxLinear MXL111SF DVB-T demodulators, Abilis AS102 DVB receivers, and Samsung S5K6AAFX sensors.

    • Miscellaneous: Intel Sandybridge integrated memory controllers, Intel Medfield MSIC (audio/battery/GPIO...) controllers, IDT Tsi721 PCI Express SRIO (RapidIO) controllers, GPIO-based pulse-per-second clients, and STE hardware semaphores.

    • Graduations: the Conexant cx25821 V4L2 driver has moved from staging into the mainline.

Changes visible to kernel developers include:

  • The new GENHD_FL_NO_PART_SCAN device flag suppresses the normal partition scan when a new block device is added to the system.

  • The venerable block layer function __make_request() has been renamed to blk_queue_bio() and exported to modules.

  • The TAINT_OOT_MODULE taint flag is now set when out-of-tree modules are inserted into the kernel. Naturally, the module itself tells the kernel about its provenance, so this mechanism can be circumvented, but anybody trying to do that would certainly be caught and publicly shamed sooner or later.

  • A few macros (EXPORT_SYMBOL_* and THIS_MODULE) have been split out of <linux/module.h> and placed in <linux/export.h>. Code that only needs to export symbols can now use the latter include file; the result is a reduction in kernel compile time.

Despite the size of this development cycle, a number of trees ended up not being pulled. Linus explicitly avoided those that were controversial (FrontSwap and the KVM tool, for example); others seem to have simply been passed over. Some may slip in for -rc2, but, for the most part, the time has come to stabilize all of this code. If the usual pattern holds, the 3.2 release can be expected sometime around mid-January.

Comments (3 posted)

Better device power management for 3.2

By Jonathan Corbet
November 8, 2011
The Linux kernel has long had the ability to regulate the CPU's voltage and frequency for optimal behavior, where "optimal" is a function of both performance and power consumption. But a system is more than just a CPU, and there are many other components which are able to run at multiple performance levels. It is unsurprising that a proper infrastructure for managing device operating points has lagged that for the CPU, since the amount of power to be saved is usually smaller. But now that CPU power behavior is fairly well optimized, the power infrastructure is growing to encompass the rest of the system. The 3.2 kernel will have a new set of APIs intended to allow drivers to let the system find the best operating level for the devices they manage.

There are three separate pieces to the dynamic voltage and frequency scaling (DVFS) API, the first of which was actually merged for the 2.6.37 release. The "operating power points" module simply tracks the various operating levels available to a given device; the API is declared in <linux/opp.h>. Briefly, operating points are managed with:

    int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt);
    int opp_enable(struct device *dev, unsigned long freq);
    int opp_disable(struct device *dev, unsigned long freq);

Operating points are enabled by default; a driver may disable specific points to reflect temperature or performance concerns. There is a set of functions for retrieving operating points above or below a given frequency, useful for moving up or down the power/performance scale.

A driver wanting to support DVFS on a specific device would start by filling in one of these structures (declared, along with the rest of the API, in <linux/devfreq.h>):

    struct devfreq_dev_profile {
	unsigned long initial_freq;
	unsigned int polling_ms;

	int (*target)(struct device *dev, unsigned long *freq);
	int (*get_dev_status)(struct device *dev,
			      struct devfreq_dev_status *stat);
	void (*exit)(struct device *dev);
    };

Here initial_freq is, unsurprisingly, the original operating frequency of the device. Almost everything else in this structure is there to help frequency governors do their jobs. If polling_ms is non-zero, it tells the governor how often to poll the device to get its usage information; that polling will take the form of a call to get_dev_status(). That function should fill the stat structure with the relevant information:

    struct devfreq_dev_status {
	/* both since the last measure */
	unsigned long total_time;
	unsigned long busy_time;
	unsigned long current_frequency;
	void *private_data;
    };

The governor will use this information to decide whether the current operating frequency should be changed or not. Should a change be needed, the target() callback will be called to change the operating point accordingly. This function should pick a frequency at least as high as the passed in *freq, then update *freq to reflect the actual frequency chosen. The exit() callback gives the driver a chance to clean things up if the DVFS layer decides to forget about the device.

Once the devfreq_dev_profile structure is filled in, the driver registers it with:

    struct devfreq *devfreq_add_device(struct device *dev,
				       struct devfreq_dev_profile *profile,
				       const struct devfreq_governor *governor,
				       void *data);

If need be, a driver can supply its own governor to manage frequencies, but the kernel supplies a few of its own: devfreq_powersave (keeps the frequency as low as possible), devfreq_performance (keeps the frequency as high as possible), devfreq_userspace (allows control of the frequency through sysfs), and devfreq_simple_ondemand (tries to strike a balance between performance and power consumption).

The notifier mechanism built into the operating power points code can be used to automatically invoke the governor should the set of available power points change. There are a number of ways in which that change could come about; one of those is a change in expectations regarding how quickly the device can respond. For this case, 3.2 also gained an enhancement to the quality-of-service (pm_qos) code to handle per-device QOS requirements. Kernel code can express its QOS expectations for a device using these functions (all from <linux/pm_qos.h>):

    int dev_pm_qos_add_request(struct device *dev, struct dev_pm_qos_request *req,
			       s32 value);
    int dev_pm_qos_update_request(struct dev_pm_qos_request *req, s32 new_value);
    int dev_pm_qos_remove_request(struct dev_pm_qos_request *req);

The dev_pm_qos_request structure is used as a handle for managing requests, but calling code does not need to access its internals. The passed value describes the desired quality of service; the documentation is surprisingly vague on just what the units of value are. It would appear to describe the desired latency, but the desired precision is unclear.

On the driver side, the notifier interface is used:

    int dev_pm_qos_add_notifier(struct device *dev,
			    	struct notifier_block *notifier);
    int dev_pm_qos_remove_notifier(struct device *dev,
			           struct notifier_block *notifier);

When a device's quality-of-service requirements are changed, the notifier will be called with the new value. The driver can then adjust the available operating power points, disabling any that would render the device unable to meet the specified QOS requirement.

It is worth noting that none of the new code has any in-tree users as of this writing. That suggests that the interface might be more than usually volatile; once developers try to make use of this facility, they are likely to find things that can be improved. But, then, internal interfaces are always subject to change; regardless of any evolution here, the underlying capability should prove useful.

Comments (2 posted)

Fast interprocess communication revisited

November 9, 2011

This article was contributed by Neil Brown

Slightly over a year ago, LWN reported on a couple of different kernel patches aimed at providing fast, or at least faster, interprocess communication (IPC): Cross Memory Attach (CMA) and kernel-dbus (kdbus). In one of the related email threads on the linux-kernel list, a third (pre-existing) kernel patch called KNEM was discussed. Meanwhile, yet another kernel module - "binder" used by the Android platform - is in use in millions of devices worldwide to provide fast IPC, and Linus recently observed that code that actually is used is the code that is actually worth something so maybe more of the Android code should be merged despite objections from some corners. Binder wasn't explicitly mentioned in that discussion but could reasonably be assumed to be included.

This article is not about whether any of these should be merged or not. That is largely an engineering and political decision in which this author claims no expertise, and in any case one of them - CMA - has already been merged. Rather we start with the observation that this many attempts to solve essentially the same problem suggests that something is lacking in Linux. There is, in other words, a real need for fast IPC that Linux doesn't address. The current approaches to filling this gap seem to be piecemeal attempts: Each patchset is clearly addressing the needs of a specific IPC model without obvious consideration for others. While this may solve current problems, it may not solve future problems, and one of the strengths of the design of Unix and hence Linux is the full exploitation of a few key ideas rather than the ad hoc accumulation of many distinct (though related) ideas.

So, motivated by that observation we will explore these various implementations to try to discover and describe the commonality they share and to highlight the key design decisions each one makes. Hopefully this will lead to a greater understanding of both the problem space and the solution space. Such understanding may be our best weapon against chaos in the kernel.

What's your address?

One of the interesting differences between the different IPC schemes is their mechanism for specifying the destination for a message.

CMA uses a process id (PID) number combined with offsets in the address space of that process - a message is simply copied to that location. This has the advantage of being very simple and efficient. PIDs are already managed by the kernel and piggy-backing on that facility is certainly attractive. The obvious disadvantage is that there is no room for any sophistication in access control, so messages can only be sent to processes with exactly the same credentials. This will not suit every context, but it is not a problem for the target area (the MPI message passing interface) which is aimed at massively parallel implementations in which all the processes are working together on one task. In that case having uniform credentials is an obvious choice.

KNEM uses a "cookie" which is a byte string provided by the kernel and which can be copied between processes. One process registers a region of memory with KNEM and receives a cookie in return. It can then pass this cookie to other processes as a simple byte string; the recipients can then copy to or from the registered region using that cookie as an address. Here again there is an assumption that the processes are co-operating and not a threat to each other (KNEM is also used for MPI). KNEM does not actually check process credentials directly, so any process that registers a region with KNEM is effectively allowing any other process that is able to use KNEM (i.e. able to open a specific character device file) to freely access that memory.

Kdbus follows the model of D-Bus and uses simple strings to direct messages. It monitors all D-Bus traffic to find out which endpoints own which names and then, when it sees a message sent to a particular name, it routes it accordingly rather than letting it go through the D-Bus daemon for routing.

Binder takes a very different approach from the other three. Rather than using names that appear the same to all processes, binder uses a kernel-internal object for which different processes see different object descriptors: small integers much like file descriptors. Each object is owned by a particular process (which can create new objects quite cheaply) and a message sent to an object is routed to the owning process. As each process is likely to have a different descriptor (or none at all) for the one object, descriptors cannot be passed as byte strings. However they can be passed along with binder messages much like file descriptors can be passed using Unix-domain sockets.

The main reason for using descriptors rather than names appears to involve reference counting. Binder is designed to work in an object-oriented system which (unsurprisingly) involves passing messages to objects, where the messages can contain references other objects. This is exactly the pattern seen in the kernel module. Any such system needs some way of determining when an object is no longer referenced, the typical approaches being garbage collection and reference counting. Garbage collection across multiple different processes is unlikely to be practical, so reference counting is the natural choice. As binder allows communication between mutually suspicious processes, there needs to be some degree of enforcement: a process should not be able to send a message when it doesn't own a reference to the target, and when a process dies, all its references should be released. To ensure these rules are met it is hard to come up with any scheme much simpler than the one used by binder.

Possibly the most interesting observation here is that two addressing schemes used widely in Linux are completely missing in these implementations: file descriptors and socket addresses (struct sockaddr).

File descriptors are used for pipes (the original UNIX IPC), for socket pairs and other connected sockets, for talking to devices, and much more. It is not hard to imagine them being used by CMA, and binder too. They are appealing as they can be used with simple read() and write() calls and similar standard interfaces. The most likely reason that they are regularly avoided is their cost - they are not exactly lightweight. On an x86_64 system a struct file - the minimum needed for each file descriptor - is 288 bytes. Of these, maybe 64 are relevant to many novel use cases, the rest is dead weight. This weight could possibly be reduced by a more object-oriented approach to struct file but such a change would be very intrusive and is unlikely to happen. So finding other approaches is likely to become common. We see that already in the inotify subsystem which has "watch descriptors"; we see it here in binder too.

The avoidance of socket addresses does not seem to admit such a neat answer. In the cases of CMA, kdbus, and binder it doesn't seem to fit the need for various different reasons. For KNEM it seems best explained as arbitrary choice. The developer chose to write a new character device rather than a new networking domain (aka address family) and so used ioctl() and ad hoc addresses rather than sendmsg()/recvmsg() and socket addresses.

The conclusion here seems to be that there is a constant tension between protection and performance. Every step we take to control what one process can do to another by building meaning into an address adds extra setup cost and management cost. Possibly the practical approach is not to try to choose between them but to unify them and allow each client to choose. So a client could register itself with an external address that any other process can use if it knows it, or with an internal address (like the binder objects) which can only be used by a process that has explicitly been given it. Further, a registered address may only accept explicit messages, or may be bound to a memory region that other processes can read and write directly. If such addresses and messages could be used interchangeably in the one domain it might allow a lot more flexibility for innovation.

Publish and subscribe

One area where kdbus stands out from the rest is in support for a publish/subscribe interface. Each of the higher level IPC services (MPI, Binder, D-Bus) have some sort of multicast or broadcast facility, but only kdbus tries to bring it into the kernel. This could simply reflect the fact that multicast does not need to be optimized and can be adequately handled in user space. Alternately it could mean that implementing it in the kernel is too hard so few people try.

There are two ways we can think about implementing a publish/subscribe mechanism. The first follows the example of IP multicast where a certain class of addresses is defined to be multicast addresses and sockets can request to receive multicasts to selected addresses. Binder does actually have a very limited form of this. Any binder client can ask to be notified when a particular object dies; when a client closes its handle on the binder (e.g. when it exits) all the objects it owns die and messages are accordingly published for all clients who have subscribed to that object. It would be tempting to turn this into a more general publish/subscribe scheme.

The second way to implement publish/subscribe is through a mechanism like the Berkeley packet filter that the networking layer provides. This allows a socket to request to receive all messages, but the filter removes some of them based on content following an almost arbitrary program (which can now be JIT compiled). This is more in line with the approach that kdbus uses. D-Bus allows clients to present "match" rules such that they receive all messages with content that matches the rules. kdbus extracts those rules by monitoring D-Bus traffic and uses them to perform multicast routing in the kernel.

Alban Crequy, the author of kdbus, appears to have been exploring both of these approaches. It would be well worth considering this effort in any new fast-IPC mechanism introduced into Linux to ensure it meets all use cases well.

Single copy

A recurring goal in many efforts at improving communication speed is to reduce the number of times that message data is copied in transit. "Zero-copy" is sometime seen as the holy-grail and, while it is usually impractical to reach that, single-copy can be attained; three of our four examples do achieve it. The fourth, kdbus, doesn't really try to achieve single-copy. The standard D-Bus mechanism is four copies - sender to kernel to daemon to kernel to receiver. Kdbus reduces this to two copies (and more particularly reduces context-switches to one) which is quite an improvement. The others all aim for single-copy operation.

CMA and KNEM achieve single-copy performance by providing a system call which simply copies from one address space to the other with various restrictions as we have already seen. This is simple, but not secure in a hostile environment. Binder is, again, quite different. With binder, part of the address space of each process is managed by the binder module through the process calling mmap() on the binder file descriptor. Binder then allocates pages and places them in the address space as required.

This mapped memory is read-only to the process, all writing is performed by the kernel. When a message is sent from one process to another the kernel allocates some space in the destination process's mapped area, copies the message directly from the sending process, and then queues a short message to the receiving process telling it where the received message is. The recipient can then access that message directly and will ultimately tell the binder module that it is finished with the message and that the memory can be reused.

While this approach may seem a little complex - having the kernel effectively provide a malloc() implementation (best fit as it happens) for the receiving process - it has the particular benefit that it requires no synchronization between the sender and the recipient. The copy happens immediately for the sender and it can then move on assuming it is complete. The receiver doesn't need to know anything about the message until it is all there ready and waiting (much better to have the message waiting than the processes waiting).

This asynchronous behavior is common to all the single-copy mechanisms, which makes one wonder if using Linux's AIO (Asynchronous Input/Output) subsystem might provide another possible approach. The sender could submit an asynchronous write, the recipient an asynchronous read, and when the second of the two arrives the copy is performed and each is notified. One unfortunate, though probably minor, issue with this approach is that, while Linux-aio can submit multiple read and write requests in a single system call and can receive multiple completion notifications in another system call, it cannot do both in one. This contrasts with the binder which has a WRITE_READ ioctl() command that sends messages and then waits for the reply, allowing an entire transaction to happen in a single system call. As we have seen with addition of recvmmsg() and, more recently, sendmmsg(), doing multiple things in a single system call has real advantages. As Dave Miller once observed:

The old adage about syscalls being cheap no longer holds when we're talking about traversing all the way into the protocol stack socket code every call, taking the socket lock every time, etc.

Tracking transactions

All of the high-level APIs for IPC make a distinction between requests and replies, connecting them in some way to form a single transaction. Most of the in-kernel support for messaging doesn't preserve this distinction with any real clarity. Messages are just messages and it is up to user space to determine how they are interpreted. The binder module is again an exception; understanding why helps expose an important aspect of the binder approach.

Though the code and the API do not present it exactly like this, the easiest way to think about the transaction tracking in binder is to imagine that each message has a "transaction ID" label. A request and its reply will have the same label. Further, if the recipient of the message finds that it needs to make another IPC before it can generate a final reply, it will use the same label on this intermediate IPC, and will obviously expect it on the intermediate reply.

With this labeling in place, Binder allows (and in fact requires) a thread which has sent a message, and which is waiting for a reply to that message, to only receive further messages with the same transaction ID. This rule allows a thread to respond to recursive calls and, thus, allow that thread's own original request to progress, but causes it to ignore any new calls until the current one is complete. If a process is multithreaded, each thread can work on independent transactions separately, but a single thread is tied to one complex transaction at a time.

Apart from possibly simplifying the user-space programming model, this allows the transaction as a whole to have a single CPU scheduling priority inherited from the originating process. Binder presents a model that there is just one thread of control involved in a method call, but that thread may wander from one address space to another to carry out different parts of the task. This migration of process priority allows that model to be more fully honored.

While many of the things that binder does are "a bit different", this is probably the most unusual. Having the same open file descriptor behave differently in different threads is not what most of us would expect. Yet it seems to be a very effective way to implement an apparently useful feature. Whether this feature is truly generally useful, and whether or not there is a more idiomatic way to provide it in Linux are difficult questions. However they are questions that need to be addressed if we want the best possible high-speed IPC in our kernel of choice.

Inter-Programmer Communication

There is certainly no shortage of interesting problems to solve in the Linux kernel, and equally no shortage of people with innovative and creative solutions. Here we have seen four quite different approaches to one particular problem and how each brings value of one sort or another. However each could probably be improved by incorporating ideas and approaches from one of the others, or by addressing needs that others present.

My hope is that by exposing and contrasting the different solutions and the problems they address, we can take a step closer to finding unifying solutions that address both today's needs and the needs for our grandchildren.

Comments (16 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.2-rc1 Nov 08
Con Kolivas 3.1.0-ck1 Nov 03
Thomas Gleixner 3.0.8-rt22 Nov 08
Thomas Gleixner 3.0.8-rt23 Nov 08
Greg KH Linux 2.6.33.20 Nov 07
Greg KH Linux 2.6.32.47 Nov 07
Greg KH Linux 2.6.32.48 Nov 09

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Christoph Hellwig XFS status update for October 2011 Nov 07

Filesystems and block I/O

Davidlohr Bueso tmpfs: support user quotas Nov 07

Memory management

Jerome Marchand Enforce RSS+Swap rlimit Nov 07
Johannes Weiner memcg naturalization -rc5 Nov 08

Networking

Glauber Costa per-cgroup tcp memory pressure Nov 07
Johannes Berg monitor-less AP mode part 1 Nov 07
Ian Campbell skb paged fragment destructors Nov 09

Virtualization and containers

Miscellaneous

Pekka Enberg Linux KVM tool for v3.2 Nov 04
Stefan Behrens Btrfs: runtime integrity check tool Nov 09

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds