Release status
Kernel release status
The current 2.6 prepatch is 2.6.24-rc3,
released by Linus on
November 16. Along with a lot of fixes it contains support for newer
I/OAT devices and a patch marking the PID namespace feature as
"experimental." See
the short-form
changelog for a list of patches, or
the
full changelog for the details.
As of this writing, a very small number of post-rc3 fixes has been merged
into the mainline git repository.
The current stable 2.6 release is 2.6.23.8, released on
November 16. A couple of days earlier,
Greg Kroah-Hartman had started a
new stable update review with this note:
Ok, I've been slacking
on the -stable front for a bit here, and didn't realize how far behind I've
gotten. Everyone has been sending patches in, which is great, but now we
are facing a HUGE 114 patch release.
As a way of making life
easier for reviewers, he split those patches into several distinct chunks,
each of which has now come out as a stable release. So we have 2.6.23.2 (core kernel changes),
2.6.23.3 (architecture-specific fixes),
2.6.23.4 (networking),
2.6.23.5 (network drivers),
2.6.23.6 (other drivers),
2.6.23.7 (filesystems), and
2.6.23.8 (security fixes - but note that
there are security-related fixes in the other updates too). The 2.6.23.9
update, featuring 29 patches, is in the review
process currently.
For older kernels: 2.6.22.13 was
released on November 16 with (only) security fixes. The 2.6.22.14 release,
with a couple dozen fixes, is in the review
process as of this writing.
For ancient kernels: 2.4.35.4 was released on
November 17 with a handful of fixes. 2.4.36-pre2 was also released
with many of the same fixes.
Comments (none posted)
Kernel development news
Quotes of the week
And along the 802.11n front, I'm _this_ close to getting a major
chipset's specs, so hopefully we might have some more work for you
to do soon. Now of only the lawyers would hurry up...
--
Greg Kroah-Hartman builds the suspense
I always considered HIGHMEM to just be unusable. It's ok for
extending to 2-4GB (ie HIGHMEM4G, not 64G), and it's probably
borderline usable for 4-8G if you are careful.
But quite frankly, I refuse to even care about anything past
that. If you have 12G (or heaven forbid, even more) in your
machine, and you can't be bothered to just upgrade to a 64-bit CPU,
then quite frankly, *I* personally can't be bothered to care.
--
Linus Torvalds
Comments (none posted)
sys_indirect()
By Jonathan Corbet
November 19, 2007
Creating user-space APIs is a hard task. Even if an interface seems
complete and well designed when it is created, the future can often add new
requirements which the old API is hard-put to satisfy. So, for example,
Unix started with the
wait() system call. As applications got
more complicated, it became necessary to wait for a specific process, to
get more information about exiting processes, to wait in a non-blocking
manner, and so on. So now, in addition to
wait(), we have
waitid(),
waitpid(), and
wait4(). Since old
versions of system calls can (almost) never go away, changing needs over
time tend to cause a proliferation of new calls.
Most recently, Ulrich Drepper has been asking for the ability to add flags
to system calls which create file descriptors, but which have no flags
argument. Examples of these include socket() and
accept(). It is possible to adjust the behavior of file
descriptors created with these system calls after the fact (with
fcntl()), but there will always be a period during which the file
descriptors exist, but the desired behavior has not been set. When that
behavior is "close on exec," and a multi-threaded program is running, one
thread might run a new program with exec() before another one has
managed to set the "close on exec flag." The result of this race is a
leaked file descriptor which can, in turn, be a security
problem. The only efficient way to close this particular race is for the
kernel to create file descriptors with the desired flags set from the
outset.
Traditionally, this sort of problem would be solved through the creation of
a new system call; one could, for example, add a four-argument
socket4() which has the requisite flags parameter. This
solution is unsatisfying, though; as has been seen, it leads to an
ever-growing list of system calls. So it would be nice to find a different
solution. Ulrich thinks he has done so by adding a single system call
(indirect()), which works by passing additional information to
existing system calls.
It should be noted that the first
sys_indirect() implementation was created by Davide Libenzi
back in July. Ulrich wasn't entirely happy with that code, though:
Davide's previous implementation is IMO far more complex than
warranted. This code here is trivial, as you can see. I've
discussed this approach with Linus last week and for a brief moment
we actually agreed on something.
The prototype for the new system call looks something like this:
int indirect(struct indirect_registers *regs,
void *userparams,
size_t paramslen,
int flags);
The regs structure holds the process registers normally used in
system calls; the system call number and its (normal) arguments, in other
words. The extra parameters to be passed to the system call live in
userparams, with a length of paramslen. The
flags argument is currently unused; it's there for any sort of
future expansion, since extending indirect() with itself is not
allowed.
The task_struct structure has been extended with a new field:
union indirect_params indirect_params;
This union is meant to contain fields for each sort of parameter which can
be added to a system call; in Ulrich's patch it looks like this:
union indirect_params {
struct {
int flags;
} file_flags;
};
It can, thus, be used to pass a flags argument to system calls
which deal in file descriptors.
When indirect() is called, it checks the requested system call
number against an internal whitelist. If the specific system call has not
been marked as being extensible in this way, the call fails with
EINVAL. Otherwise the application-supplied parameters are copied
into the current process's task_struct structure and the system
call is invoked in the usual way. Once that system call completes, the
indirect_params area in the task structure is zeroed.
The kernel provides no indication to the system call that it has been
invoked via indirect(); the only difference in that case is that
there might be non-zero values in indirect_params. So, in a
sense, this mechanism can be seen as a way to add parameters to system
calls with a default value of zero. So it is not possible, without some
additional work, to add a parameter to a system call where passing a value
of zero has a different meaning than omitting the parameter altogether.
Should a need for yet another parameter materialize in the future, the size
of the indirect_params structure can be increased as needed. As
long as the kernel retains the old behavior when the new parameter has a
value of zero, older applications and libraries will continue to operate as
they did before. The extended system call need not (and cannot) know
whether the larger indirect_params structure is being used or not.
There is a possible use for this mechanism beyond extending system calls:
the syslet developers see it as a possible way of specifying asynchronous
behavior. The current syslet patches are essentially an indirect wrapper
layer around system calls which specifies that the call is asynchronous
(and what to do with the results). Adding two separate indirect layers for
system calls seems like a suboptimal solution, so there is interest in
adding syslet information to indirect() instead. That is one of
the intended purposes for the currently-unused flags argument.
Naturally, it would be surprising to see applications ever making calls to
indirect(), well, directly. A much more likely scenario is for
uses of indirect() to be buried inside the C library, which would
then export a more straightforward
interface to the application.
While some developers (including Linus, evidently) like this patch set,
others are less enthusiastic. David Miller was
blunt in his review, saying: "I think this indirect syscall stuff
is the most ugly interface I've ever seen proposed for the kernel."
H. Peter Anvin is also unimpressed:
I think it is a horrible kluge. It's yet another multiplexer,
which we are trying desperately to avoid in the kernel. Just to
make things more painful, it is a multiplexer which creates yet
another ad hoc calling convention, whereas we should strive to make
the kernel calling convention as uniform as possible.
So would it not be surprising if this new system call were to evolve
somewhat before making its way into the mainline - it's a new and
somewhat tricky API which could certainly benefit from discussion. But
there are some real needs driving this work. So
chances are that indirect() will eventually show up, in some form,
in mainline kernels.
Comments (16 posted)
Supporting electronic paper
November 19, 2007
This article was contributed by Jaya Kumar
The familiar CRT monitors or backlit LCD screens on our desks continuously
consume power in order to hold an image. Electronic paper (e-paper) is
different: power is only needed to change the image. Just like paper,
e-paper is able to hold the image permanently without consuming any power.
Displays using CRT, backlit LCD, plasma and OLED technologies are all
emissive, meaning that they have to produce the photons that reach the eye.
This implies that they have to compete in brightness with ambient lighting,
which can result in eye strain. E-paper is the opposite: it is reflective,
which makes it possible to read the display using ambient light even in the
brightness of a hot sunny day.
E-paper is referred to as a bistable or non-volatile technology because of
its ability to hold a specific pixel state without power. There are several
variations of e-paper; they differ in terms of which physical mechanism
is used to achieve the non-volatility of the display. These mechanisms
include interferometric
modulation, bi-stable twisted nematic liquid
crystal [PDF], cholesteric
liquid crystal, and electrophoretic
phenomena.
Interferometric modulation uses the same principle of light wave
interference that results in the rainbow of colors seen with oil floating on
water. Control of wave interference through bi-stable or multi-stable
micro-electro-mechanical systems (MEMS) is what enables electronic control
of the color of a pixel.
In standard twisted nematic liquid crystal displays (TNLCD), the liquid
crystal is sandwiched between two rubbed polymer orthogonal alignment
layers. Bi-stable twisted nematic implementations such as Zenithal liquid
crystal replace the first or both alignment layers in favour of a sub-micron
relief profile that weakens anchoring to the surface and makes it possible
to latch various stable orientations of a liquid crystal pixel using
electrical pulses.
Cholesteric liquid crystal provides the ability to selectively reflect
various ranges of wavelengths of visible light based on the pitch of the
liquid crystal. The pitch can then be electronically controlled to set
various pixel states.
Electrophoresis describes the fact that particles within a fluid can be
kinetically affected by an electrical field. Basically, applying a voltage
pulse causes pigment particles within a solvent solution to move. This
concept is what is used to control whether a pixel appears black, white or a
shade of gray. This article will focus on electrophoretic displays
since they are relatively easy to obtain.
Controllers
Traditional display controllers are interfaced to the host using a bus such
as PCI Express or AGP. These controllers have local framebuffer memory or
sufficient internal line buffering to utilize shared host memory; they
expose their framebuffers through memory mappable regions.
Display servers like Xorg or Xfbdev that utilize the kernel's fbdev
interface expect to be able to mmap() the device framebuffer. The
implication is that a driver that implemented only write()/seek() access to
the framebuffer would have limited usage.
Electrophoretic displays require specialized controllers that are capable of
driving suitable waveforms in order to control the display media. This is
because of subtle issues around the behavior of pigment particles within a
solvent. The controller must drive waveforms that result in fast,
reproducible and optimal movement of pigment particles. These waveforms are
a key factor in minimizing pixel update latency, achieving good contrast and
reducing ghosting effects in the output image. Currently, electrophoretic
display updates are significantly slower than CRT or LCD display updates.
For example, a grayscale update of E-Ink's most recent Vizplex display
material can take up to 740ms. This latency has an effect on how hardware is
interfaced with electrophoretic display controllers and how software should
then interact with the display.
One of the electrophoretic display controllers for which Linux support has
been posted (tarball) is
a controller from E-Ink called Apollo. This controller is
interfaced to the host through 8-bit data and 6-bit control over General
Purpose IO (GPIO) interfaces. The implication of the use of GPIO is that it
is not a memory mappable interface. Each pixel of the framebuffer has to be
wiggled to the controller by turning individual GPIO lines on and off. Display
updates on the Apollo with an E-Ink 6" panel with a resolution of 800x600
and 2 bits of grayscale require between 500ms - 1200ms. Given this set of
circumstances, it would have been an option to implement a userspace library
or support code that performed the GPIO wiggling. However, such an
implementation would forfeit support from Xfbdev and other common
fbdev compatible applications.
An early driver implementation has also been posted for an E-Ink controller
named Metronome.
This controller interfaces to the host using the Active
Matrix LCD (AMLCD) bus. AMLCD is a 16-bit data bus used to interface LCDs
with CPUs. Normally, the AMLCD bus is used to transfer video display
data only, but, in the case of the Metronome controller, the host transfers a whole
slew of things including waveform, command and image data. The Metronome
becomes a secondary display controller feeding on the output of the primary
display controller on the host. Since AMLCD is an output-only data path, two
GPIO pins are used to retrieve status from the controller.
Many embedded processors provide a built-in LCD controller (LCDC) that is
compatible with the AMLCD interface. For example, the Xscale pxa255 cpu has
an LCDC that has DMA support and is able to pull data directly from host
memory at specified intervals. This type of capability allows drivers to
remap host memory to form an mmap-able framebuffer. However, the Metronome
controller imposes an additional requirement beyond delivering image data
for each display update. This is the need for a specific display update
command that has to be formed and set each time the display is to be
updated. This means that the framebuffer driver needs to know when the
framebuffer has been updated. That is not a trivial task because the nature
of a memory-mapped framebuffer is that the driver is not involved in
changes to the buffer; it is
therefore unaware of when the framebuffer has been written to by a userspace
application.
The three problems described so far can therefore be summarized as follows:
- How to memory map a "non memory mappable" IO interface like GPIO.
- How to mitigate the latency associated with display updates.
- How to cheaply detect when userspace has written to a memory mapped
address.
One early solution to problem 3 was to use a timer and perform framebuffer
differencing to detect the changed pixels. The negative aspects of this
solution are that it requires a large amount of redundant memory and
significant cpu and memory bandwidth consumption every time that framebuffer
differencing is done. Both of these resources are scarce on embedded systems
and, therefore, that solution was not satisfactory.
Deferred IO
Deferred IO is an alternative method of solving these problems. The key concept
behind deferred IO is that one can periodically mark an active page of host
memory as read-only in order to catch writes to it. The way it works is quite
straightforward: page table entries for framebuffer pages in host memory are
initially marked as read-only. When the application first writes to any memory
address that maps to any of those pages, a deferred IO specific page fault
handler is reached. This handler schedules a delayed workqueue job. In the interval
before this workqueue is executed, the application can continue to write to
that page with no additional cost.
When the workqueue task executes, it then marks
the page table entry as read-only and then processes the framebuffer data
stored in that page. At that point, the processed data can be delivered to the
device through its native IO interface, which could be GPIO, AMLCD, USB, or
anything else. Since the page was re-marked to read-only, the sequence would
repeat if the application ever rewrote that page. This is somewhat similar
to a writeback cache. Host memory is used as a cache for device memory or
any output destined for the device. The page fault is then used as a
trigger to determine when to actually "writeback" this memory to the device.
This technique solves problem 1 because host memory is used and can therefore be
memory mapped. The output from the application intended for the device is
written to host memory and, unlike hardware supported memory mapped IO,
this output is not transfered to the device for each memory write. It is
only after the driver specified delay has expired that this collected data
is transfered to the device. The fact that the transfer would be through
GPIO or any other mechanism is transparent to the application and requires
no intervention.
The delay between the page fault and the IO is what addresses problem 2. The
application sees only a framebuffer which happens to be in host memory.
Writes to the framebuffer are therefore as fast as writes to any other part
of memory. The display update latency is therefore transparent to the
application. The driver specified interval should be selected to be
appropriate for the latency of the device. For example, if the device has a
one-second display update latency, then a one-second delay would be reasonable.
A longer delay would result in the display being less interactive than what
it was really capable of. A shorter delay would result in host updates
building up since the device would not be able to keep up. Applications that
require display synchronization primitives could use fsync() or the
FBIO_WAITFORVSYNC ioctl depending on their needs.
Problem 3 is solved because the address that is the cause of the page fault
is known. Internally, deferred IO uses the memory management subsystem's
page_mkwrite() callback and page_mkclean() to implement the core of its
functionality. The current deferred IO implementation passes a list of
page structures to the framebuffer driver's deferred IO callback. The driver
can then use page->index to identify which part of the framebuffer was
written to. This provides PAGE_SIZE granularity in identifying the updated
pixels.
Status
This method works fine in common use cases. For example, if one were to run
xpdf and use page-up to flip through pages, then xpdf would render to the
framebuffer in host memory on each page-up. Then, at the end of each write
induced interval, the driver would deliver the current image to the display.
This would give the effect where one would be seeing the most recent page on
the display rather than every single page that had been flipped through.
This enables the system to be reasonably interactive. Applications like
xclock (an analog clock ticking every second) as well as playback
applications (displaying a slider showing playback position) behave in a
similar fashion.
Deferred IO support was merged into the Linux kernel in 2.6.22;
Documentation/fb/deferred_io.txt contains additional information. The driver
for the Apollo controller was also merged in 2.6.22 and is in
drivers/video/hecubafb.c. The driver for the Metronome controller
is posted
but not yet complete; it also
includes necessary bugfixes for deferred IO.
The current development focus is on the Metronome controller. It is being
tested with a Gumstix Connex board which has an Xscale pxa255 CPU. The
display media that is being used is an E-Ink Vizplex 6" 800x600 panel with 3
bits of grayscale. The metronomefb driver for this controller uses deferred
IO and is still a work-in-progress but it is capable of running Xfbdev. X
clients such as xclock, xeyes, xlogo and xloadimage have been run without
problems. It is not yet clear how to measure framebuffer performance on such
a system; the reason for this is that most display benchmarks use the time
for a drawing operation to complete as the basis for performance statistics.
On this system, such a benchmark would be merely measuring time to render to
host memory rather than time to deliver to the actual display. It may be
necessary to develop an alternate method of measuring display system
performance for e-paper displays.
All is not yet perfect.
Applications that render images that affect only a small number of pixels
but cross multiple pages because of the framebuffer layout (eg: a thin
vertical image) result in reduced efficiency. This is because the ratio of
changed pixels to the number of written pages is low.
The architectural weakness of deferred IO is that it depends on the system
having an MMU. It may be possible to implement a similar approach using the
lower level memory protection capabilities that are available on some no-MMU
systems. For example, the Blackfin architecture has a Data Cacheability
Protection Lookaside Buffer (DCPLB) that has notions of read/write
permissions on its entries. This will be an interesting area for future
exploration.
The current implementation only works with framebuffers allocated from
virtual memory. Support needs to be implemented to achieve the same functionality
with memory obtained from kmalloc() or the DMA layer.
There have been suggestions that this technique may be useful in other
areas. One scenario that has been mentioned is optimizing display bandwidth
consumption by switching between DMA and plain memory copies based on the
number of written pages. Another scenario is USB-to-VGA adapters. It may
also be the case that any device connected via a relatively slow bus where
the data flow is primarily output could benefit from a similar approach.
Acknowledgments:
the author is grateful to E-Ink engineers for their extensive support and
hardware help, Peter Zijlstra, Antonino Daplas, Paul Mundt, Geert Uytterhoeven,
Hugh Dickins, James Simmons and others for mm, fbdev, and general help.
Comments (5 posted)
PID namespaces in the 2.6.24 kernel
November 19, 2007
This article was contributed by Pavel Emelyanov and Kir Kolyshkin
One of the new features in the upcoming 2.6.24 kernel will be the PID
namespaces support developed by the OpenVZ team with the help of IBM.
The PID namespace allows for creating sets of tasks, with each such set looking
like a standalone machine with respect to process IDs. In other words,
tasks in different namespaces can have the same IDs.
This feature is the major prerequisite for the migration of containers between
hosts; having a namespace, one may move it to another host while keeping the PID
values -- and this is a requirement since a task is not expected to change
its PID. Without this feature, the migration will very likely fail, as
the processes with the same IDs can exist on the destination node, which
will cause conflicts when addressing tasks by their IDs.
PID namespaces are hierarchical; once a new PID namespace is created,
all the tasks in the current PID namespace will see the tasks (i.e. will
be able to address them with their PIDs) in this new namespace. However,
tasks from the new namespace will not see the ones from the current.
This means that now each task has more than one PID -- one for each namespace.
User-space API
To create a new namespace, one should just call the clone(2)
system call with the CLONE_NEWPID flag set.
After this, it is useful to change the root directory and mount
a new procfs instance in the /proc to make the common utilities
like ps work.
Note that since the parent knows the PID of its child, it may
wait() in the usual way for it to exit.
The first task in a new namespace will have a PID of 1. Thus, it
will be this namespace's init and child reaper, so all the orphaned
tasks will be re-parented to it. Unlike the standalone machine, this "init"
can die, and in this case, the whole namespace will be terminated.
Since now we will have isolated sets of tasks, we should make proc
show only the set of PIDs which is visible for a particular task. To achieve
this goal, procfs should be mounted multiple times -- once
for each namespace. After this the PIDs that are shown in the mounted instance
will be from the namespace which created that mount.
For example, a user may create some new proc_2 directory,
spawn a PID namespace and mount a procfs to it. After this, the
user will be able to see the PIDs as they appear inside this new namespace.
There will be the PID number 1, which is the namespace's init,
and all the other PIDs may coincide with some PIDs from the current namespace,
but refer to some other task.
No other changes in the user API are necessary. Tasks still have the ability to
get their PIDs, PGIDs, etc. with the known system calls. They can also
work with sessions and groups. Tasks may create threads and work with futexes.
Internal API
All the PIDs that a task may have are described in the struct pid.
This structure contains the ID value, the list of tasks having this ID,
the reference counter and the hashed list node to be stored in the
hash table for a faster search.
A few more words about the lists of tasks. Basically a task has three PIDs:
the process ID (PID), the process group ID (PGID), and the
session ID (SID). The PGID and the SID may be shared between the tasks,
for example, when two or more tasks belong to the same group, so each
group ID addresses more than one task.
With the PID namespaces this structure becomes elastic. Now, each PID
may have several values, with each one being valid in one namespace. That is,
a task may have PID of 1024 in one namespace, and 256 in another. So, the
former struct pid changes.
Here is how the struct pid looked like before introducing
the PID namespaces:
struct pid {
atomic_t count; /* reference counter */
int nr; /* the pid value */
struct hlist_node pid_chain; /* hash chain */
struct hlist_head tasks[PIDTYPE_MAX]; /* lists of tasks */
struct rcu_head rcu; /* RCU helper */
};
And this is how it looks now:
struct upid {
int nr; /* moved from struct pid */
struct pid_namespace *ns; /* the namespace this value
* is visible in
*/
struct hlist_node pid_chain; /* moved from struct pid */
};
struct pid {
atomic_t count;
struct hlist_head tasks[PIDTYPE_MAX];
struct rcu_head rcu;
int level; /* the number of upids */
struct upid numbers[0];
};
As you can see, the struct upid now represents the PID
value -- it is stored in the hash and has the PID value.
To convert the struct pid to the PID or vice versa one may
use a set of helpers like task_pid_nr(), pid_nr_ns(),
find_task_by_vpid(), etc.
All these calls has some information in their names:
..._nr()
- These operate with the so called "global" PIDs.
Global PIDs are the numbers that are unique in the whole system, just
like the old PIDs were. E.g.
pid_nr(pid) will tell you the
global PID of the given struct pid. These are only useful
when the PID value is not going to leave the kernel. For example, some code
needs to save the PID and then find the task by it. However, in this
case saving the direct pointer on the struct pid is
more preferable as global PIDs are going be used in kernel logs only.
..._vnr()
- These helpers work with the "virtual" PID, i.e.
with the ID as seen by a process. For example,
task_pid_vnr(tsk)
will tell you the PID of a task, as this task sees it (with
sys_getpid()). Note that this value will most likely be
useless if you're working in another namespace, so these are always used when working
with the current task, since all tasks always see their virtual PIDs.
..._nr_ns()
- These work with the PIDs as seen from the specified
namespace. If you want to get some task's PID (for example, to report it to
the userspace and find this task later), you may call
task_pid_nr_ns(tsk, current->nsproxy->pid_ns) to get
the number, and then find the task using
find_task_by_pid_ns(pid, current->nsproxy->pid_ns).
These are used in system calls, when the PID comes from the user
space. In this case one task may address another which exists in
another namespace.
Conclusion
The interface as described here has been merged for the 2.6.24 kernel
release. It has, however, been marked as "experimental" to prevent its
wide deployment by distributors while some remaining issues are worked
out. Few, if any, changes to this API are expected between now and when
the "experimental" tag is removed in a later kernel release.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Kernel building
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>