The current 2.6 prepatch is 2.6.24-rc2
, somewhat belatedly, on
November 6. "There was nothing in particular holding this thing
up, I just basically just forgot to cut a -rc2 release last week.
Patches merged since -rc1 are mostly fixes, but there's also some DCCP
improvements, a flag to silence warnings about use of deprecated
interfaces, an asynchronous event notification API for SCSI/SATA (it tells
the system when a removable disk has been inserted), a Japanese
translation of the SubmittingPatches document, an ATA link power management
API, and a bit more x86 unification work. See the short-form changelog
for a list of
patches, or the
for the details.
As of this writing, no patches have found their way into the mainline git
repository since the -rc2 release.
For older kernels: 188.8.131.52 and 184.108.40.206 came out on November 2
and 5, respectively. These releases contain a number of patches,
including one which is security-related for both people running Minix
filesystems. Greg Kroah-Hartman has
recently said that, contrary to previous
indications, the 2.6.22.x series will continue for a while yet.
220.127.116.11 and 18.104.22.168 were released on
November 1 and 5, respectively. They contain quite a few fixes,
several of which have vulnerability numbers associated with them.
Comments (none posted)
Kernel development news
We've already got way too many incomplete concepts and APIs in the
kernel. Maybe i'm over-worrying, but i fear we end up like with
capabilities or sendfile - code merged too soon and never completed
for many years - perhaps never completed at all. VMS and WNT did
those things a bit better i think - their API frameworks were/are
pervasive and complete, even in the corner cases.
-- Ingo Molnar
[W]hat concerns me is that stringbuf was good, but not great. Yet I
always think of the kernel as a bastion of really good C code and
practice, carefully honed by thoughtful coders. Here even the
unmeasured optimization attempts show a lack of effort on the part
of experienced kernel coders.
What makes the code in the kernel so great is not that it goes in
perfect, it's that we whittle all code in the tree down over and
over again until it reaches it's perfect form. It is this
whittling system that allows us to thrust 25MB of changes into a
tree over a week and a half and it all works out.
Comments (none posted)
The management of video hardware has long been an area of weakness in the
Linux system (and free operating systems in general). The X Window System
tends to get a lot of the blame for problems in this area, but the truth of
the matter is that the problems are more widespread and the kernel has
never made it easy for X to do this job properly. Graphics processors (GPUs) have
gotten steadily more powerful, to the point that, by some measures, they
are the fastest processor on most systems, but kernel support for the
programming of GPUs has lagged behind. A lot of work is being done to remedy
this situation, though, and an important component of that work has just
been put forward for inclusion into the mainline kernel.
Once upon a time, video memory comprised a simple frame buffer from which
pixels were sent to the display; it was up to the system's CPU to put
useful data into that frame buffer. With contemporary GPUs, the memory
situation has gotten more complex; a typical GPU can work with a few
different types of memory:
- Video RAM (VRAM) is high-speed memory installed directly on the video
card. It is usually visible on the system's PCI bus, but that need
not be the case. There is likely to be a frame buffer in this memory,
but many other kinds of data live there as well.
- In many lower-end systems, the "video RAM" is actually a dedicated
section of general-purpose system memory. That RAM is set aside for
the use of the GPU and is not available for other purposes. Even
adapters with their own VRAM may have a dedicated RAM region as well.
- Video adapters contain a simple memory management unit (the graphics
address remapping table or GART) which can be used to map various
pages of system memory into the GPU's address space. The result is
that, at any time, an arbitrary (scattered) subset of the system's RAM
pages are accessible to the GPU.
Each type of video memory has different characteristics and constraints.
Some are faster to work with (for the CPU or the GPU) than others. Some types
of VRAM might not be directly addressable by the CPU. Memory may or may not
be cache coherent - a distinction which requires careful programming to
avoid data corruption and performance problems. And graphical applications
may want to work with much larger amounts of video memory than can be made
visible to the GPU at any given time.
All of this presents a memory management problem which, while being similar
to the management needs of system RAM, has its own special constraints. So
the graphics developers have been saying for years that Linux needs a
proper manager for GPU-accessible memory. But, for years, we have done
without that memory manager, with the result that this management task has
been performed by an ugly combination of code in the X server, the kernel,
and, often, in proprietary drivers.
Happily, it would appear that those days are coming to an end, thanks to
the creation of the translation-table maps (TTM) module written primarily
by Thomas Hellstrom, Eric Anholt, and Dave Airlie. The TTM code provides a
general-purpose memory manager aimed at the needs of GPUs and graphical
The core object managed by TTM, from the point of view of user space, is
the "buffer object." A buffer object is a chunk of memory allocated by an
application, and possibly shared among a number of different applications.
It contains a region of memory which, at some point, may be operated on by
the GPU. A buffer object is guaranteed not to vanish as long as some
application maintains a reference to it, but the location of that buffer is
subject to change.
Once an application creates a buffer object, it can map that object into
its address space. Depending on where the buffer is currently located,
this mapping may require relocating the buffer into a type of memory which
is addressable by the CPU (more accurately, a page fault when the
application tries to access the mapped buffer would force that move).
Cache coherency issues must be handled as well, of course.
There will come a time when this buffer must be made available to the GPU
for some sort of operation. The TTM layer provides a special "validate"
ioctl() to prepare buffers for processing; validating a buffer
could, again, involve moving it or setting up a GART mapping for it. The
address by which the GPU will access the buffer will not be known until it
is validated; after validation, the buffer will not be moved out of the
GPU's address space until it is no longer being operated on.
That means that the kernel has to know when processing on a given buffer
has completed; applications, too, need to know that.
To this end, the TTM layer provides "fence" objects. A fence is
a special operation which is placed into the GPU's command FIFO. When the
fence is executed, it raises a signal to indicate that all instructions
enqueued before the fence have now been executed, and that the GPU will no
longer be accessing any associated buffers. How the signaling works is
very much dependent on the GPU; it could raise an interrupt or simply write
a value to a special memory location. When a fence signals, any associated
buffers are marked as no longer being referenced by the GPU, and any
interested user-space processes are notified.
A busy system might feature a number of graphical applications, each of
which is trying to feed buffers to the GPU at any given time. It is not at
all unlikely that the demands for GPU-addressable buffers will exceed the
amount of memory which the GPU can actually reach. So the TTM layer will
have to move buffers around in response to incoming requests. For
GART-mapped buffers, it may be a simple matter of unmapping pages from
buffers which are not currently validated for GPU operations. In other
cases, the contents of the buffers may have to be explicitly copied to
another type of memory, possibly using the GPU's hardware to do so. In such cases,
the buffers must first be invalidated in the page tables of any user-space
process which has mapped it to ensure that the buffer will not be written
to during the move. In other words, the TTM really does become an
extension of the system's memory management code.
The next question which is bound to come up is: what happens when graphical
applications want to use more video memory than the system as a whole can
provide? Normal system RAM pages which are used as video memory are locked
in place (and unavailable for other uses), so there must be a clear limit
on the number of such pages which can be created. The current solution to
this problem is to cap the number of such pages at a fraction of the available low memory
- up to 1GB on a 4GB, 32-bit system. It would be nice to be able to extend
this memory by writing unused pages to swap, but the Linux swap implementation is
not meant to work with pages owned by the kernel. The long-term plan would
appear to be to let the X server create a large virtual range with
mmap(), which would then be swappable. That functionality has not
yet been implemented, though.
There is a lot more to the TTM code than has been described here; some more
information can be found in this TTM overview [PDF].
For the time being, this code works with a patched version of the Intel
i915 driver, with other drivers to be added in the future. TTM has been
proposed for inclusion into -mm now and merger into the mainline for
2.6.25. The main issue between now and then will be the evaluation of the
user-space API, which will be hard to change once this code is merged.
Unfortunately, documentation for this API appears to be scarce at the
Comments (7 posted)
Last week's article on
discussed process ID namespaces. The purpose of these
namespaces is to manage which processes are
visible to a process inside a container. The heavy use of PIDs to identify
processes has caused this particular patch to go through a long period of
development before being merged for 2.6.24. It appears that there are some
remaining issues, though, which could prevent this feature from being
available in the next kernel release. As is often the case, the biggest
problems come down to user-space API issues.
On November 1, Ingo Molnar pointed out that
questions raised by Ulrich Drepper back in early 2006 remained
unanswered. These questions all have to do with what happens when the use
of a PID escapes the namespace that it belongs to. There are a number of
kernel APIs related to interprocess communication and synchronization where
this could happen. Realtime signals carry process ID information, as do
SYSV message queues. At best, making these interfaces work properly across
PID namespaces will require that the kernel perform magic PID translations
whenever a PID crosses a namespace boundary.
The biggest sticking point, though, would appear to be the robust futex
mechanism, which uses PIDs to track which process owns a specific futex at
any given time. One of the key points behind futexes is that the fast
acquisition path (when there is no contention for the futex) does not
require the kernel's involvement at all. But that acquisition path is also
where the PID
field is set. So there is no way to let the kernel perform magic PID
translation without destroying the performance feature that was the
motivation for futexes in the first place.
Ingo, Ulrich, and others who are concerned about this problem would like to
see the PID namespace feature completely disabled in the 2.6.24 release so
that there will be time to come up with a proper solution. But it is not
clear what form that solution would take, or if it is even necessary.
The approach seemingly favored by Ulrich is
to eliminate some of the fine-grained control that the kernel currently
provides over the sharing of namespaces. With the 2.6.24-rc1 interface, a
process which calls clone() can request that the child be placed
into a new PID namespace, but that other namespaces (filesystems, for
example, or networking) be shared. That, says Ulrich, is asking for trouble:
This whole approach to allow switching on and off each of the
namespaces is just wrong. Do it all or nothing, at least for the
problematic ones like NEWPID. Having access to the same filesystem
but using separate PID namespaces is simply not going to work.
Coalescing a number of the namespace options into a single "new container" bit
would help with the current shortage of clone bits. But it might well not
succeed in solving the API issues. Even processes with different
filesystem namespaces might be able to find the same futex via a file
visible in both namespaces. The passing of credentials over Unix-domain
sockets could throw in an interesting twist. And it would seem that there
are other places where PIDs are used that nobody has really thought of
Another possible approach, one which hasn't really featured in the current
debate, would be to create globally-unique PIDs which would work across
namespaces. The current 32-bit PID value could be split into two fields,
with the most significant bits indicating which namespace the PID
(contained in the least significant bits) is defined in. Most of the time,
only the low-order part of the PID would be needed; it would be interpreted
relative to the current PID namespace. But, in places where it makes
sense, the full, unique PID could be used. That would enable features like
futexes to work across PID namespaces.
There are still problems, of course. The whole point of PID namespaces is
to completely hide processes which are outside of the current namespace;
the creation and use of globally-unique PIDs pokes holes in that
isolation. And there's sure to be some complications in the user-space API
which prove to be hard to paper over.
Then, there is the question of whether this problem is truly important or
not. Linus thinks not, pointing out that
the sharing of PIDs across namespaces is analogous to the use of
PIDs in lock files shared across a network. PID-based locking does not work on
NFS-mounted files, and PID-based interfaces will not work between PID
namespaces. Linus concludes:
I don't understand how you can call this a "PID namespace design
bug", when it clearly has nothing what-so-ever to do with pid
namespaces, and everything to do with the *futexes* that blithely
assume that pid's are unique and that made it part of the
One could argue that the conflict with PID namespaces was known when the
robust futex feature was merged and that something could have been done at
that time. But that does not really help anybody now. And, in any case,
there are issues beyond futexes.
PID namespaces are a
significant complication of the user-space API; they redefine a basic
value which has had a well-understood meaning since the early days of
Unix. So it is not surprising that some interesting questions have come to
light. Getting solid answers to nagging API questions has not always been
the strongest point of the Linux development process, but things could
always change. With luck and some effort, these questions can be worked
through so that PID namespaces, when they become available, will have
well-thought-out and well-defined semantics in all cases and will support
the functionality that users need.
Comments (19 posted)
As the amount of RAM installed in systems grows, it would seem that
memory pressure should reduce, but, much like salaries or hard disk
space, usage grows to fill (or overflow) the available capacity. Operating
systems have dealt with this problem for decades by using virtual memory
and swapping, but techniques that work well with 4 gigabyte address spaces
may not scale well to systems with 1 terabyte. That scalability problem is
at the root of several different ideas for changing the kernel, from
supporting larger page
sizes to avoiding memory
Another approach to scaling up the memory management subsystem was
recently posted to linux-kernel by Rik van Riel. His
patch is meant to reduce the
amount of time the kernel spends looking for a memory page to evict when it
needs to load a new page. He lists two main deficiencies of the current
page replacement algorithm. The first is that it sometimes evicts the wrong
page; this cannot be eliminated, but its frequency might be reduced. The second is the heart of what he is trying to accomplish:
The kernel scans over pages that should not be evicted. On systems with a
few GB of RAM, this can result in the VM using an annoying amount of CPU.
On systems with >128GB of RAM, this can knock the system out for hours
since excess CPU use is compounded with lock contention and other issues.
A system with 1TB of 4K pages has 256 million pages to deal with.
Searching through the pages stored on lists in the kernel can take an
enormous amount of time. According to van Riel, most of that time is spent
searching pages that won't be evicted anyway, so in order to deal with
systems of that size, the search needs to focus in on likely candidates.
Linux tries to optimize its use of physical memory, by keeping it full,
using any memory not needed by processes for caching file data in the page
cache. Determining which pages are not being used by processes and
striking a balance between the page cache and process memory is the job of
the page replacement algorithm. It is that algorithm that van Riel would
eventually like to see replaced.
The current set of patches, though, take a smaller step. In today's
kernel, there are two lists of pages, active and inactive, for each memory
zone. Pages move
between them based on how recently they were used. When it is time
to find a page to evict, the kernel searches the inactive list for
candidates. In many cases, it is looking for page cache pages,
particularly those that are unmodified and can simply be dropped, but has
to wade through an enormous number of process-memory pages to find them.
The solution proposed is to break both lists apart, based on the type of
page. Page cache pages (aka file pages) and process-memory pages (aka
anonymous pages) will each live on their own active and inactive lists.
When the kernel is looking for a specific type, it can choose the proper
list to reduce the amount of time spent searching considerably.
This patch is an update to an earlier proposal by van Riel, covered here last March. The
patch is now broken into ten parts, allowing for easier reviewing. It has
also been updated to the latest kernel, modified to work with various
features (like lumpy reclaim)
that have been added in the interim.
Additional features are planned to be added down the road, as outlined on
van Riel's page
replacement design web page. Adding a non-reclaimable list for pages
that are locked to physical memory with mlock(), or are part of a
RAM filesystem and cannot be evicted, is one of the first changes listed.
It makes little sense to scan past these pages.
Another feature that van Riel lists is to track recently evicted pages so
that, if they get loaded again, the system can reduce the likelihood of
another eviction. This should help keep pages in the page cache that get
accessed somewhat infrequently, but are not completely unused. There are
also various ideas about limiting the sizes of the active and inactive
lists to put a bound on worst-case scenarios. van Riel's plans also
include making better decisions about when to run the out-of-memory (OOM)
killer as well as making it faster to choose its victim.
Overall, it is a big change
to how the page replacement code works today, which is why it will be
broken up into smaller chunks. By making changes that add incremental
improvements, and getting them into the hands of
developers and testers, the hope is that the bugs can be shaken out more
easily. Before that can happen, though, this set of patches must
pass muster with the kernel hackers and be merged. The external
user-visible impacts of these particular patches should be small, but they are fairly intrusive,
touching a fair amount of code. In addition, memory management patches
tend to have a tough path into the kernel.
Comments (2 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>