Brief items
The current 2.6 prepatch is 2.6.19-rc1,
released on October 4,
several milliseconds after last week's Kernel Page was published. For a
summary of changes, see
this
article and
this one from
the last two weeks. Highlights include
the parallel ATA driver set,
labeled networking for IPsec
and CIPSO security, a few new architectures, lots of new drivers, the
GFS2 cluster filesystem,
eCryptfs, and large numbers of
internal changes.
The long-format
changelog has the details - but, since we're talking about almost 5000
patches from over 600 contributors, it's best to have a lot of time on
one's hands. The short-form changelog is
somewhat more compact, but still lengthy.
At this point in the process, patches going into the mainline repository
are supposed to be confined to fixes. Many of them are, but Linus has
merged a few other significant changes, including, as predicted, the interrupt handler prototype
change, which has caused changes throughout the tree. There is a new
epoll_pwait() system call which takes an additional signal mask
parameter, and the venerable (but long-unused)
<linux/config.h> include file has been removed at last.
Also merged is the developmental ext4 filesystem, which includes a number of enhancements,
including support for extents and 48-bit block numbers.
See the ext4 documentation file if you are
interested in playing with ext4 (and have good backups).
The current -mm tree is 2.6.19-rc1-mm1. Recent changes
to -mm include the addition of ext4 (which promptly moved on into the
mainline), continued work on the swap token mechanism, a generic
log2() implementation, and the dynamic tick patch.
Comments (3 posted)
Kernel development news
Maintaining drivers out of tree is shameless autoflagellation at
the best of times. We really don't care -- if we didn't make life
hard for them in this way they'd only go and stick pins under their
fingernails to make up for the lack of pain. If you think about it
like that, we're probably doing them a favour -- at least this way
they're _safe_.
-- David Woodhouse
Comments (40 posted)
The
nopfn() VMA operation was added for 2.6.19-rc1; see
this article from last month for
information on this method. It turns out, though, that
nopfn()
might just be one of the shortest-lived kernel API extensions in some time;
Nick Piggin has posted
a series
of patches which will bring significant changes to how page faults are
handled at the lowest levels.
The 2.6.19-rc1 vm_operations_struct structure defines three
methods which handle low-level paging:
struct page *(*nopage)(struct vm_area_struct *area,
unsigned long address, int *type);
unsigned long (*nopfn)(struct vm_area_struct *area,
unsigned long address);
int (*populate)(struct vm_area_struct *area, unsigned long address,
unsigned long len, pgprot_t prot,
unsigned long pgoff, int nonblock);
Ordinarily, page faults are handled by nopfn() (if it exists) or
nopage(). Those functions are supposed to take the given
address and associate it with a page in physical memory. For
virtual memory areas (VMAs) which are backed up by files, the virtual
filesystem layer reacts to a nopage() call by allocating a page of
memory, reading the appropriate contents from backing store, then passing
the page back to the kernel for insertion into the page tables. Device
drivers which implement nopage() typically just translate the
address into an appropriate pointer for an in-memory buffer being
mapped into user space.
Both nopfn() and nopage() assume that the mapping between
virtual memory addresses and the offset within the VMA is linear - that is
why only the address is provided as a parameter. The kernel, however, also
supports nonlinear mappings,
where an application can turn a VMA into a complex window into different
parts of the backing file. The nopfn() and nopage()
methods cannot handle these mappings, since they do not have the required
information. Instead, any backing store which supports nonlinear mappings
must provide a populate() method, which has parameters for both
the virtual memory address and the associated offset
(pgoff) into the backing store device.
Enter Nick, who was working on a tricky race condition found within one of
the most notoriously tricky parts of the kernel: the code which handles
file truncation. In some conditions, a page which was being removed as a
result of a truncate() call could be simultaneously faulted in via
nopage(), leading to memory management confusion. While
rethinking the locking rules for these operations, Nick decided that there
should be a better way. The result was a new VMA operation called
fault():
struct fault_data {
struct vm_area_struct *vma;
unsigned long address;
pgoff_t pgoff;
unsigned int flags;
int type;
};
struct page *(*fault)(struct vm_area_struct *vma,
struct fault_data *fdata);
This method is intended to replace all of nopfn(),
nopage(), and populate(). When a page fault happens, the
kernel fills in the fault_data structure with the needed
information: the user-space address associated with the fault, the
corresponding offset pgoff, and a couple of flags which indicate
whether the fault happened on a write access and whether a nonlinear
mapping is involved.
The fault() function should locate a page which can satisfy a
request for the offset pgoff; it won't normally need
address at all. The function can then either return the
associated struct page, or set the page table entry directly (with
something like vm_insert_page()) and return NULL. Either
way, the type field should be set to the type of fault (major or
minor). If the fault cannot be handled, the appropriate error code should
be put into type instead.
Nick's patch gets rid of the nopfn() and populate()
methods immediately. There is currently only one user of nopfn(),
and the older populate() API has never been widely used outside of
the mainline kernel. The install_page() function is also destined
for a near-term demise. The nopage() method, instead, is widely
used by device drivers, inside and outside of the mainline. So it has been
marked as deprecated and scheduled for removal one year from now, in
October, 2007. There have been suggestions that nopage() should
go sooner (after six months, say), but no definitive decision.
Details like that aside, there appears to be broad support for this
change. These patches would probably be a bit too new for 2.6.19, even if
the merge window were still open, so 2.6.20 is the earliest likely date for
them to appear in the mainline. But, at that point, driver and out-of-tree
filesystem maintainers will have some updating to do.
Comments (1 posted)
| October 4, 2006 |
| This article was contributed by Paul McKenney |
Classic RCU requires that read-side critical sections obey the same rules
obeyed by the critical sections of pure spinlocks: blocking or sleeping
of any sort is strictly prohibited.
This has frequently been an obstacle to the use of RCU, and
I have received numerous requests for a ``sleepable RCU'' (SRCU) that
permits arbitrary sleeping (or blocking) within RCU read-side critical
sections.
I had previously rejected all such requests as unworkable, since arbitrary
sleeping in RCU read-side could indefinitely extend grace periods, which
in turn could result in arbitrarily large amounts of memory awaiting the
end of a grace period, which finally would result in system hangs due
to memory exhaustion.
After all, any concurrency-control primitive that could result in
system hangs -- even when used correctly - does not deserve to exist.
However, the realtime kernels that require spinlock critical sections
be preemptible [3] also require that RCU read-side critical
sections be preemptible [2].
Preemptible critical sections in turn require that lock-acquisition
primitives block in order to avoid deadlock,
which in turns means that both RCU's and spinlocks'
critical sections be able to block awaiting a lock.
However, these two forms of sleeping have the special property that
priority boosting and priority inheritance may be used to awaken
the sleeping tasks in short order.
Nevertheless,
use of RCU in realtime kernels was the first crack in the tablets
of stone on which were inscribed ``RCU read-side critical sections can never
sleep''.
That said, indefinite sleeping, such as blocking waiting for an
incoming TCP connection, is strictly verboten even in realtime kernels.
Quick Quiz 1: Why is sleeping prohibited within Classic RCU read-side
critical sections?
Quick Quiz 2:
Why not permit sleeping in Classic RCU read-side critical sections
by eliminating context switch as a quiescent state, leaving user-mode
execution and idle loop as the remaining quiescent states?
(Click below for the rest of this lengthy, technical article - and the
answers to the quick quiz questions).
Full Story (comments: 10)
Your editor has recently had the opportunity to write a Linux driver for a
camera device - the camera which will be packaged with the One Laptop Per
Child system, in particular. This driver works with the internal kernel
API designed for such purposes: the Video4Linux2 API. In the process of
writing this code, your editor made the shocking discovery that, in fact,
this API is not particularly well documented - though the user-space side
is, instead,
quite
well documented indeed. In an attempt to remedy the
situation somewhat, LWN will, over the coming months, publish a series of
articles describing how to write drivers for the V4L2 interface.
V4L2 has a long history - the first gleam came into Bill Dirks's eye back
around August of 1998. Development proceeded for years, and the V4L2 API
was finally merged into the mainline in November, 2002, when 2.5.46 was released. To this
day, however, quite a few Linux drivers do not support the newer API; the
conversion process is an ongoing task. Meanwhile, the V4L2 API continues
to evolve, with some major changes being made in 2.6.18. Applications
which work with V4L2 remain relatively scarce.
V4L2 is designed to support a wide variety of devices, only some of which
are truly "video" in nature:
- The video capture interface grabs video data from a tuner or
camera device. For many, video capture will be the primary
application for V4L2. Since your editor's experience is strongest in
this area, this series will tend to emphasize the capture API, but
there is more to V4L2 than that.
- The video output interface allows applications to drive
peripherals which can provide video images - perhaps in the form of a
television signal - outside of the computer.
- A variant of the capture interface can be found in the video
overlay interface, whose job is to facilitate the direct display
of video data from a capture device. Video data moves directly from
the capture device to the display, without passing through the
system's CPU.
- The VBI interfaces provide access to data transmitted during
the video blanking interval. There are two of them, the "raw" and
"sliced" interfaces, which differ in the amount of processing of the
VBI data performed in hardware.
- The radio interface provides access to audio streams from AM
and FM tuner devices.
Other types of devices are possible. The V4L2 API has some stubs for
"codec" and "effect" devices, both of which perform transformations on
video data streams. Those areas have not yet been completely specified,
however, much less implemented. There are also the "teletext" and "radio
data system" interfaces currently implemented in the older V4L1 API; those
have not been moved to V4L2 and there do not appear to be any immediate
plans to do so.
Video devices differ from many others in the vast number of ways in which
they can be configured. As a result, much of a V4L2 driver implements code
which enables applications to discover a given device's capabilities and to
configure that device to operate in the desired manner. The V4L2 API
defines several dozen callbacks for the configuration of parameters like
tuner frequencies, windowing and cropping, frame rates, video compression,
image parameters (brightness, contrast, ...), video standards, video
formats, etc. Much of this series will be devoted to looking at how this
configuration process happens.
Then, there is the small task of actually performing I/O at video rates in
an efficient manner. The V4L2 API defines three different ways of moving
video data between user space and the peripheral, some of which can be on
the complex side. Separate articles will look at video I/O and the
video-buf layer which has been provided to handle common tasks.
Subsequent articles will appear every few weeks, and will be added to the
list below:
Comments (6 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>