LWN.net Logo

Kernel development

Release status

Kernel release status

The current 2.6 prepatch is 2.6.24-rc3, released by Linus on November 16. Along with a lot of fixes it contains support for newer I/OAT devices and a patch marking the PID namespace feature as "experimental." See the short-form changelog for a list of patches, or the full changelog for the details.

As of this writing, a very small number of post-rc3 fixes has been merged into the mainline git repository.

The current stable 2.6 release is 2.6.23.8, released on November 16. A couple of days earlier, Greg Kroah-Hartman had started a new stable update review with this note:

Ok, I've been slacking on the -stable front for a bit here, and didn't realize how far behind I've gotten. Everyone has been sending patches in, which is great, but now we are facing a HUGE 114 patch release.

As a way of making life easier for reviewers, he split those patches into several distinct chunks, each of which has now come out as a stable release. So we have 2.6.23.2 (core kernel changes), 2.6.23.3 (architecture-specific fixes), 2.6.23.4 (networking), 2.6.23.5 (network drivers), 2.6.23.6 (other drivers), 2.6.23.7 (filesystems), and 2.6.23.8 (security fixes - but note that there are security-related fixes in the other updates too). The 2.6.23.9 update, featuring 29 patches, is in the review process currently.

For older kernels: 2.6.22.13 was released on November 16 with (only) security fixes. The 2.6.22.14 release, with a couple dozen fixes, is in the review process as of this writing.

For ancient kernels: 2.4.35.4 was released on November 17 with a handful of fixes. 2.4.36-pre2 was also released with many of the same fixes.

Comments (none posted)

Kernel development news

Quotes of the week

And along the 802.11n front, I'm _this_ close to getting a major chipset's specs, so hopefully we might have some more work for you to do soon. Now of only the lawyers would hurry up...
-- Greg Kroah-Hartman builds the suspense

I always considered HIGHMEM to just be unusable. It's ok for extending to 2-4GB (ie HIGHMEM4G, not 64G), and it's probably borderline usable for 4-8G if you are careful.

But quite frankly, I refuse to even care about anything past that. If you have 12G (or heaven forbid, even more) in your machine, and you can't be bothered to just upgrade to a 64-bit CPU, then quite frankly, *I* personally can't be bothered to care.

-- Linus Torvalds

Comments (none posted)

sys_indirect()

By Jonathan Corbet
November 19, 2007
Creating user-space APIs is a hard task. Even if an interface seems complete and well designed when it is created, the future can often add new requirements which the old API is hard-put to satisfy. So, for example, Unix started with the wait() system call. As applications got more complicated, it became necessary to wait for a specific process, to get more information about exiting processes, to wait in a non-blocking manner, and so on. So now, in addition to wait(), we have waitid(), waitpid(), and wait4(). Since old versions of system calls can (almost) never go away, changing needs over time tend to cause a proliferation of new calls.

Most recently, Ulrich Drepper has been asking for the ability to add flags to system calls which create file descriptors, but which have no flags argument. Examples of these include socket() and accept(). It is possible to adjust the behavior of file descriptors created with these system calls after the fact (with fcntl()), but there will always be a period during which the file descriptors exist, but the desired behavior has not been set. When that behavior is "close on exec," and a multi-threaded program is running, one thread might run a new program with exec() before another one has managed to set the "close on exec flag." The result of this race is a leaked file descriptor which can, in turn, be a security problem. The only efficient way to close this particular race is for the kernel to create file descriptors with the desired flags set from the outset.

Traditionally, this sort of problem would be solved through the creation of a new system call; one could, for example, add a four-argument socket4() which has the requisite flags parameter. This solution is unsatisfying, though; as has been seen, it leads to an ever-growing list of system calls. So it would be nice to find a different solution. Ulrich thinks he has done so by adding a single system call (indirect()), which works by passing additional information to existing system calls.

It should be noted that the first sys_indirect() implementation was created by Davide Libenzi back in July. Ulrich wasn't entirely happy with that code, though:

Davide's previous implementation is IMO far more complex than warranted. This code here is trivial, as you can see. I've discussed this approach with Linus last week and for a brief moment we actually agreed on something.

The prototype for the new system call looks something like this:

    int indirect(struct indirect_registers *regs,
                 void *userparams,
		 size_t paramslen,
		 int flags);

The regs structure holds the process registers normally used in system calls; the system call number and its (normal) arguments, in other words. The extra parameters to be passed to the system call live in userparams, with a length of paramslen. The flags argument is currently unused; it's there for any sort of future expansion, since extending indirect() with itself is not allowed.

The task_struct structure has been extended with a new field:

    union indirect_params indirect_params;

This union is meant to contain fields for each sort of parameter which can be added to a system call; in Ulrich's patch it looks like this:

    union indirect_params {
	struct {
	    int flags;
	} file_flags;
    };

It can, thus, be used to pass a flags argument to system calls which deal in file descriptors.

When indirect() is called, it checks the requested system call number against an internal whitelist. If the specific system call has not been marked as being extensible in this way, the call fails with EINVAL. Otherwise the application-supplied parameters are copied into the current process's task_struct structure and the system call is invoked in the usual way. Once that system call completes, the indirect_params area in the task structure is zeroed.

The kernel provides no indication to the system call that it has been invoked via indirect(); the only difference in that case is that there might be non-zero values in indirect_params. So, in a sense, this mechanism can be seen as a way to add parameters to system calls with a default value of zero. So it is not possible, without some additional work, to add a parameter to a system call where passing a value of zero has a different meaning than omitting the parameter altogether.

Should a need for yet another parameter materialize in the future, the size of the indirect_params structure can be increased as needed. As long as the kernel retains the old behavior when the new parameter has a value of zero, older applications and libraries will continue to operate as they did before. The extended system call need not (and cannot) know whether the larger indirect_params structure is being used or not.

There is a possible use for this mechanism beyond extending system calls: the syslet developers see it as a possible way of specifying asynchronous behavior. The current syslet patches are essentially an indirect wrapper layer around system calls which specifies that the call is asynchronous (and what to do with the results). Adding two separate indirect layers for system calls seems like a suboptimal solution, so there is interest in adding syslet information to indirect() instead. That is one of the intended purposes for the currently-unused flags argument.

Naturally, it would be surprising to see applications ever making calls to indirect(), well, directly. A much more likely scenario is for uses of indirect() to be buried inside the C library, which would then export a more straightforward interface to the application.

While some developers (including Linus, evidently) like this patch set, others are less enthusiastic. David Miller was blunt in his review, saying: "I think this indirect syscall stuff is the most ugly interface I've ever seen proposed for the kernel." H. Peter Anvin is also unimpressed:

I think it is a horrible kluge. It's yet another multiplexer, which we are trying desperately to avoid in the kernel. Just to make things more painful, it is a multiplexer which creates yet another ad hoc calling convention, whereas we should strive to make the kernel calling convention as uniform as possible.

So would it not be surprising if this new system call were to evolve somewhat before making its way into the mainline - it's a new and somewhat tricky API which could certainly benefit from discussion. But there are some real needs driving this work. So chances are that indirect() will eventually show up, in some form, in mainline kernels.

Comments (16 posted)

Supporting electronic paper

November 19, 2007

This article was contributed by Jaya Kumar

The familiar CRT monitors or backlit LCD screens on our desks continuously consume power in order to hold an image. Electronic paper (e-paper) is different: power is only needed to change the image. Just like paper, e-paper is able to hold the image permanently without consuming any power. Displays using CRT, backlit LCD, plasma and OLED technologies are all emissive, meaning that they have to produce the photons that reach the eye. This implies that they have to compete in brightness with ambient lighting, which can result in eye strain. E-paper is the opposite: it is reflective, which makes it possible to read the display using ambient light even in the brightness of a hot sunny day.

E-paper is referred to as a bistable or non-volatile technology because of its ability to hold a specific pixel state without power. There are several variations of e-paper; they differ in terms of which physical mechanism is used to achieve the non-volatility of the display. These mechanisms include interferometric modulation, bi-stable twisted nematic liquid crystal [PDF], cholesteric liquid crystal, and electrophoretic phenomena.

Interferometric modulation uses the same principle of light wave interference that results in the rainbow of colors seen with oil floating on water. Control of wave interference through bi-stable or multi-stable micro-electro-mechanical systems (MEMS) is what enables electronic control of the color of a pixel.

In standard twisted nematic liquid crystal displays (TNLCD), the liquid crystal is sandwiched between two rubbed polymer orthogonal alignment layers. Bi-stable twisted nematic implementations such as Zenithal liquid crystal replace the first or both alignment layers in favour of a sub-micron relief profile that weakens anchoring to the surface and makes it possible to latch various stable orientations of a liquid crystal pixel using electrical pulses.

Cholesteric liquid crystal provides the ability to selectively reflect various ranges of wavelengths of visible light based on the pitch of the liquid crystal. The pitch can then be electronically controlled to set various pixel states.

Electrophoresis describes the fact that particles within a fluid can be kinetically affected by an electrical field. Basically, applying a voltage pulse causes pigment particles within a solvent solution to move. This concept is what is used to control whether a pixel appears black, white or a shade of gray. This article will focus on electrophoretic displays since they are relatively easy to obtain.

Controllers

Traditional display controllers are interfaced to the host using a bus such as PCI Express or AGP. These controllers have local framebuffer memory or sufficient internal line buffering to utilize shared host memory; they expose their framebuffers through memory mappable regions. Display servers like Xorg or Xfbdev that utilize the kernel's fbdev interface expect to be able to mmap() the device framebuffer. The implication is that a driver that implemented only write()/seek() access to the framebuffer would have limited usage.

Electrophoretic displays require specialized controllers that are capable of driving suitable waveforms in order to control the display media. This is because of subtle issues around the behavior of pigment particles within a solvent. The controller must drive waveforms that result in fast, reproducible and optimal movement of pigment particles. These waveforms are a key factor in minimizing pixel update latency, achieving good contrast and reducing ghosting effects in the output image. Currently, electrophoretic display updates are significantly slower than CRT or LCD display updates. For example, a grayscale update of E-Ink's most recent Vizplex display material can take up to 740ms. This latency has an effect on how hardware is interfaced with electrophoretic display controllers and how software should then interact with the display.

One of the electrophoretic display controllers for which Linux support has been posted (tarball) is a controller from E-Ink called Apollo. This controller is interfaced to the host through 8-bit data and 6-bit control over General Purpose IO (GPIO) interfaces. The implication of the use of GPIO is that it is not a memory mappable interface. Each pixel of the framebuffer has to be wiggled to the controller by turning individual GPIO lines on and off. Display updates on the Apollo with an E-Ink 6" panel with a resolution of 800x600 and 2 bits of grayscale require between 500ms - 1200ms. Given this set of circumstances, it would have been an option to implement a userspace library or support code that performed the GPIO wiggling. However, such an implementation would forfeit support from Xfbdev and other common fbdev compatible applications.

An early driver implementation has also been posted for an E-Ink controller named Metronome. This controller interfaces to the host using the Active Matrix LCD (AMLCD) bus. AMLCD is a 16-bit data bus used to interface LCDs with CPUs. Normally, the AMLCD bus is used to transfer video display data only, but, in the case of the Metronome controller, the host transfers a whole slew of things including waveform, command and image data. The Metronome becomes a secondary display controller feeding on the output of the primary display controller on the host. Since AMLCD is an output-only data path, two GPIO pins are used to retrieve status from the controller.

Many embedded processors provide a built-in LCD controller (LCDC) that is compatible with the AMLCD interface. For example, the Xscale pxa255 cpu has an LCDC that has DMA support and is able to pull data directly from host memory at specified intervals. This type of capability allows drivers to remap host memory to form an mmap-able framebuffer. However, the Metronome controller imposes an additional requirement beyond delivering image data for each display update. This is the need for a specific display update command that has to be formed and set each time the display is to be updated. This means that the framebuffer driver needs to know when the framebuffer has been updated. That is not a trivial task because the nature of a memory-mapped framebuffer is that the driver is not involved in changes to the buffer; it is therefore unaware of when the framebuffer has been written to by a userspace application.

The three problems described so far can therefore be summarized as follows:

  1. How to memory map a "non memory mappable" IO interface like GPIO.
  2. How to mitigate the latency associated with display updates.
  3. How to cheaply detect when userspace has written to a memory mapped address.

One early solution to problem 3 was to use a timer and perform framebuffer differencing to detect the changed pixels. The negative aspects of this solution are that it requires a large amount of redundant memory and significant cpu and memory bandwidth consumption every time that framebuffer differencing is done. Both of these resources are scarce on embedded systems and, therefore, that solution was not satisfactory.

Deferred IO

Deferred IO is an alternative method of solving these problems. The key concept behind deferred IO is that one can periodically mark an active page of host memory as read-only in order to catch writes to it. The way it works is quite straightforward: page table entries for framebuffer pages in host memory are initially marked as read-only. When the application first writes to any memory address that maps to any of those pages, a deferred IO specific page fault handler is reached. This handler schedules a delayed workqueue job. In the interval before this workqueue is executed, the application can continue to write to that page with no additional cost.

When the workqueue task executes, it then marks the page table entry as read-only and then processes the framebuffer data stored in that page. At that point, the processed data can be delivered to the device through its native IO interface, which could be GPIO, AMLCD, USB, or anything else. Since the page was re-marked to read-only, the sequence would repeat if the application ever rewrote that page. This is somewhat similar to a writeback cache. Host memory is used as a cache for device memory or any output destined for the device. The page fault is then used as a trigger to determine when to actually "writeback" this memory to the device.

This technique solves problem 1 because host memory is used and can therefore be memory mapped. The output from the application intended for the device is written to host memory and, unlike hardware supported memory mapped IO, this output is not transfered to the device for each memory write. It is only after the driver specified delay has expired that this collected data is transfered to the device. The fact that the transfer would be through GPIO or any other mechanism is transparent to the application and requires no intervention.

The delay between the page fault and the IO is what addresses problem 2. The application sees only a framebuffer which happens to be in host memory. Writes to the framebuffer are therefore as fast as writes to any other part of memory. The display update latency is therefore transparent to the application. The driver specified interval should be selected to be appropriate for the latency of the device. For example, if the device has a one-second display update latency, then a one-second delay would be reasonable. A longer delay would result in the display being less interactive than what it was really capable of. A shorter delay would result in host updates building up since the device would not be able to keep up. Applications that require display synchronization primitives could use fsync() or the FBIO_WAITFORVSYNC ioctl depending on their needs.

Problem 3 is solved because the address that is the cause of the page fault is known. Internally, deferred IO uses the memory management subsystem's page_mkwrite() callback and page_mkclean() to implement the core of its functionality. The current deferred IO implementation passes a list of page structures to the framebuffer driver's deferred IO callback. The driver can then use page->index to identify which part of the framebuffer was written to. This provides PAGE_SIZE granularity in identifying the updated pixels.

Status

This method works fine in common use cases. For example, if one were to run xpdf and use page-up to flip through pages, then xpdf would render to the framebuffer in host memory on each page-up. Then, at the end of each write induced interval, the driver would deliver the current image to the display. This would give the effect where one would be seeing the most recent page on the display rather than every single page that had been flipped through. This enables the system to be reasonably interactive. Applications like xclock (an analog clock ticking every second) as well as playback applications (displaying a slider showing playback position) behave in a similar fashion.

Deferred IO support was merged into the Linux kernel in 2.6.22; Documentation/fb/deferred_io.txt contains additional information. The driver for the Apollo controller was also merged in 2.6.22 and is in drivers/video/hecubafb.c. The driver for the Metronome controller is posted but not yet complete; it also includes necessary bugfixes for deferred IO.

The current development focus is on the Metronome controller. It is being tested with a Gumstix Connex board which has an Xscale pxa255 CPU. The display media that is being used is an E-Ink Vizplex 6" 800x600 panel with 3 bits of grayscale. The metronomefb driver for this controller uses deferred IO and is still a work-in-progress but it is capable of running Xfbdev. X clients such as xclock, xeyes, xlogo and xloadimage have been run without problems. It is not yet clear how to measure framebuffer performance on such a system; the reason for this is that most display benchmarks use the time for a drawing operation to complete as the basis for performance statistics. On this system, such a benchmark would be merely measuring time to render to host memory rather than time to deliver to the actual display. It may be necessary to develop an alternate method of measuring display system performance for e-paper displays.

All is not yet perfect. Applications that render images that affect only a small number of pixels but cross multiple pages because of the framebuffer layout (eg: a thin vertical image) result in reduced efficiency. This is because the ratio of changed pixels to the number of written pages is low.

The architectural weakness of deferred IO is that it depends on the system having an MMU. It may be possible to implement a similar approach using the lower level memory protection capabilities that are available on some no-MMU systems. For example, the Blackfin architecture has a Data Cacheability Protection Lookaside Buffer (DCPLB) that has notions of read/write permissions on its entries. This will be an interesting area for future exploration.

The current implementation only works with framebuffers allocated from virtual memory. Support needs to be implemented to achieve the same functionality with memory obtained from kmalloc() or the DMA layer.

There have been suggestions that this technique may be useful in other areas. One scenario that has been mentioned is optimizing display bandwidth consumption by switching between DMA and plain memory copies based on the number of written pages. Another scenario is USB-to-VGA adapters. It may also be the case that any device connected via a relatively slow bus where the data flow is primarily output could benefit from a similar approach.

Acknowledgments: the author is grateful to E-Ink engineers for their extensive support and hardware help, Peter Zijlstra, Antonino Daplas, Paul Mundt, Geert Uytterhoeven, Hugh Dickins, James Simmons and others for mm, fbdev, and general help.

Comments (5 posted)

PID namespaces in the 2.6.24 kernel

November 19, 2007

This article was contributed by Pavel Emelyanov and Kir Kolyshkin

One of the new features in the upcoming 2.6.24 kernel will be the PID namespaces support developed by the OpenVZ team with the help of IBM. The PID namespace allows for creating sets of tasks, with each such set looking like a standalone machine with respect to process IDs. In other words, tasks in different namespaces can have the same IDs.

This feature is the major prerequisite for the migration of containers between hosts; having a namespace, one may move it to another host while keeping the PID values -- and this is a requirement since a task is not expected to change its PID. Without this feature, the migration will very likely fail, as the processes with the same IDs can exist on the destination node, which will cause conflicts when addressing tasks by their IDs.

PID namespaces are hierarchical; once a new PID namespace is created, all the tasks in the current PID namespace will see the tasks (i.e. will be able to address them with their PIDs) in this new namespace. However, tasks from the new namespace will not see the ones from the current. This means that now each task has more than one PID -- one for each namespace.

User-space API

To create a new namespace, one should just call the clone(2) system call with the CLONE_NEWPID flag set. After this, it is useful to change the root directory and mount a new procfs instance in the /proc to make the common utilities like ps work. Note that since the parent knows the PID of its child, it may wait() in the usual way for it to exit.

The first task in a new namespace will have a PID of 1. Thus, it will be this namespace's init and child reaper, so all the orphaned tasks will be re-parented to it. Unlike the standalone machine, this "init" can die, and in this case, the whole namespace will be terminated.

Since now we will have isolated sets of tasks, we should make proc show only the set of PIDs which is visible for a particular task. To achieve this goal, procfs should be mounted multiple times -- once for each namespace. After this the PIDs that are shown in the mounted instance will be from the namespace which created that mount.

For example, a user may create some new proc_2 directory, spawn a PID namespace and mount a procfs to it. After this, the user will be able to see the PIDs as they appear inside this new namespace. There will be the PID number 1, which is the namespace's init, and all the other PIDs may coincide with some PIDs from the current namespace, but refer to some other task.

No other changes in the user API are necessary. Tasks still have the ability to get their PIDs, PGIDs, etc. with the known system calls. They can also work with sessions and groups. Tasks may create threads and work with futexes.

Internal API

All the PIDs that a task may have are described in the struct pid. This structure contains the ID value, the list of tasks having this ID, the reference counter and the hashed list node to be stored in the hash table for a faster search.

A few more words about the lists of tasks. Basically a task has three PIDs: the process ID (PID), the process group ID (PGID), and the session ID (SID). The PGID and the SID may be shared between the tasks, for example, when two or more tasks belong to the same group, so each group ID addresses more than one task.

With the PID namespaces this structure becomes elastic. Now, each PID may have several values, with each one being valid in one namespace. That is, a task may have PID of 1024 in one namespace, and 256 in another. So, the former struct pid changes.

Here is how the struct pid looked like before introducing the PID namespaces:

    struct pid {
	atomic_t count;				/* reference counter */
	int nr;					/* the pid value */
	struct hlist_node pid_chain;		/* hash chain */
	struct hlist_head tasks[PIDTYPE_MAX];	/* lists of tasks */
	struct rcu_head rcu;			/* RCU helper */
    };
And this is how it looks now:
    struct upid {
	int nr;					/* moved from struct pid */
	struct pid_namespace *ns;		/* the namespace this value
						 * is visible in
						 */
	struct hlist_node pid_chain;		/* moved from struct pid */
    };

    struct pid {
	atomic_t count;
	struct hlist_head tasks[PIDTYPE_MAX];
	struct rcu_head rcu;
	int level;				/* the number of upids */
	struct upid numbers[0];
    };

As you can see, the struct upid now represents the PID value -- it is stored in the hash and has the PID value. To convert the struct pid to the PID or vice versa one may use a set of helpers like task_pid_nr(), pid_nr_ns(), find_task_by_vpid(), etc.

All these calls has some information in their names:

..._nr()
These operate with the so called "global" PIDs. Global PIDs are the numbers that are unique in the whole system, just like the old PIDs were. E.g. pid_nr(pid) will tell you the global PID of the given struct pid. These are only useful when the PID value is not going to leave the kernel. For example, some code needs to save the PID and then find the task by it. However, in this case saving the direct pointer on the struct pid is more preferable as global PIDs are going be used in kernel logs only.

..._vnr()
These helpers work with the "virtual" PID, i.e. with the ID as seen by a process. For example, task_pid_vnr(tsk) will tell you the PID of a task, as this task sees it (with sys_getpid()). Note that this value will most likely be useless if you're working in another namespace, so these are always used when working with the current task, since all tasks always see their virtual PIDs.

..._nr_ns()
These work with the PIDs as seen from the specified namespace. If you want to get some task's PID (for example, to report it to the userspace and find this task later), you may call task_pid_nr_ns(tsk, current->nsproxy->pid_ns) to get the number, and then find the task using find_task_by_pid_ns(pid, current->nsproxy->pid_ns). These are used in system calls, when the PID comes from the user space. In this case one task may address another which exists in another namespace.

Conclusion

The interface as described here has been merged for the 2.6.24 kernel release. It has, however, been marked as "experimental" to prevent its wide deployment by distributors while some remaining issues are worked out. Few, if any, changes to this API are expected between now and when the "experimental" tag is removed in a later kernel release.

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

Device drivers

Documentation

Filesystems and block I/O

Kernel building

Memory management

Networking

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds