Brief items
The current stable 2.6 kernel is 2.6.11.6, which was
released (with a handful of
security patches) on March 25.
The current 2.6.12 prepatch remains 2.6.12-rc1; no 2.6.12 prepatches have
been released in the last week.
Linus's BitKeeper repository contains a number of architecture updates, an
XFS update, some netpoll improvements, a new __nocast annotation
which allows "sparse" to catch certain type mismatches, a change from
io_remap_page_range() to io_remap_pfn_range(), and lots
of fixes.
The current -mm tree is 2.6.12-rc1-mm3.
Recent changes to -mm include the addition of David Miller's networking
tree and Herbert Xu's crypto tree, some core page table handling cleanups,
a big DVB update, a number of cleanups to the (ugly and insecure) ISO9660
filesystem code, and lots of fixes.
The current 2.4 prepatch is 2.4.30-rc4, released by Marcelo on March 30 with a
couple of regression fixes. Previously, 2.4.30-rc3 was
released on
March 26. The -rc3 patch contained a single fix to a serious problem
introduced in 2.4.30-rc2
which had been released (with several fixes) the day before.
Comments (none posted)
Kernel development news
In NFSv4 we often want to serialize asynchronous RPC calls with
ordinary RPC calls (OPEN and CLOSE for instance). On paper,
semaphores would appear to fit the bill, however there is no
support for asynchronous I/O with semaphores. <rant>What's
more, trying to add that type of support is an exercise in
futility: there are currently 23 slightly different arch-dependent
and over-optimized versions of semaphores (not counting the
different versions of read/write semaphores).</rant>
--Trond Myklebust
Comments (none posted)
Ingo Molnar's massive realtime preemption patch is an attempt to bring
near-realtime response to the stock Linux kernel. It works by making almost
everything in the kernel preemptible. Spinlocks turn into preemptible
mutexes; interrupt handlers get moved into preemptible kernel threads,
etc. The result is a major change in how the scheduling of kernel code is
done and quick response to external events.
This work has been quieter in recent times, but it has not stalled by
any means.
When LWN last looked at the realtime preemption
patch, one of the remaining rough spots was its interaction with the
read-copy-update (RCU) mechanism. RCU, remember, encapsulates a
conceptually simple (though a bit more gnarly in the implementation)
technique. A resource of interest (a routing table entry, say) is
referenced by a pointer. When that resource must be changed, a copy is
made and the changes are done there; the pointer is then directed at the
new copy. At some future, safe time, the old version can be freed. Linux
RCU works by requiring that all accesses to RCU-protected data structures
be atomic; with that constraint, a "safe time" can be defined as "after
every processor on the system has scheduled." Since scheduling while
holding a reference to an RCU-protected structure is against the rules, any
such structure which was made inaccessible before all processors schedule
cannot be referenced by any processor afterward.
Since accesses to RCU-protected structures must be atomic, the RCU locking
function (rcu_read_lock()) disables preemption. But disabling
preemption is exactly what the realtime preemption patch is trying to get
away from, so something had to give. Ingo had solved this problem by
requiring that all RCU users identify an explicit lock which protects the
structures in question, and modifying the RCU locking functions to take
that lock as a parameter. This approach was never optimal. It caused the
creation of a whole
new family of new RCU functions to cope with every type of lock that might
be used, and, simultaneously, decreased the flexibility of the RCU read
locking mechanism. And, to a great extent, it simply replaced RCU with
more traditional locking which, while it works, does not have the
scalability advantages which were the motivation for RCU in the first
place.
The RCU issue was clearly on Ingo's mind:
If PREEMPT_RT is merged into the upstream kernel then it will (at
least initially) be at a status similar to NOMMU: it will be
tolerated as long as it causes no 'drag' on the main code. The RCU
API variants i introduced clearly violated this requirement, and
were my #1 worry wrt. upstream mergability.
So Ingo was pleased when RCU creator Paul McKenney proposed some approaches for making RCU and
realtime preemption work together. Paul's message goes through a series of
increasingly complex solutions, and is worth reading in its own right. The
core idea, however, is that, in a fully preemptible world, RCU cannot
depend on atomic access to data structures, and thus cannot use the "all
processors have scheduled" heuristic to know that the time has come to
execute a given set of RCU cleanup functions. So the tracking of code
executing within RCU critical sections must be made more explicit. Paul's
solutions used a reader/writer lock for that purpose, but the approach
taken in Ingo's latest realtime preemption
patch is a little different.
The code executed to go into an RCU-protected section now looks like this
(when configured for realtime preemption):
void rcu_read_lock(void)
{
if (current->rcu_read_lock_nesting++ == 0) {
current->rcu_data = &get_cpu_var(rcu_data);
atomic_inc(¤t->rcu_data->active_readers);
smp_mb__after_atomic_inc();
put_cpu_var(rcu_data);
}
}
The idea is simple: a per-CPU count of processes in RCU critical sections
is kept. When a process goes into a critical section, a pointer to the
current CPU's counter is stored with the task information, so
that the right counter will be decremented later on. There is also a
per-process variable which keeps track of RCU section nesting. No further
work needs to be done before the process can access the protected
structure; in particular, no locks are acquired.
When the process exits the critical section, the process is reversed: the
nesting count is decremented. When that count goes to zero, the per-CPU
count is decremented as well. If the per-CPU count drops to zero, then
that processor is deemed to have "quiesced," with no processes running
within RCU critical sections. Once all CPUs have quiesced in this way (as
tracked by a bitmask of processors in the system), all RCU cleanup
functions queued before their respective processors quiesced can be
called.
This scheme restores the core RCU functionality, allowing lock-free access
to fast-path data structures. It also retains the current RCU API, with
the result that the realtime preemption patch becomes significantly less
intrusive. It is not a perfect implementation, however. It requires that
each CPU regularly find itself with no processes executing within RCU
critical sections. Since these sections are now preemptible, the "quiet"
times could be quite far apart on heavily-loaded systems. While the system
is waiting for a processor to quiesce, the RCU callback structures for the
cleanup functions will continue to accumulate, to the point that quite a
bit of memory could be used before the cleanup actually happens. For the
realtime case, this tradeoff is acceptable: latency, not memory use, is the
most important factor. Since the existing RCU algorithm is used when
realtime preemption is not configured in, everybody should be happy. In
practice, further work may be required; in particular, it may be necessary
to find a way to force RCU cleanup when the system gets low on memory.
Meanwhile, however, the realtime
preemption patch appears to have gotten past one more major hurdle on its
way toward possible inclusion into the mainline.
Comments (1 posted)
Attentive readers of patches being merged for 2.6.12-rc2 will have noticed
the use of a new attribute:
__nocast. For example, the prototype
of
kmalloc() has changed to:
void *kmalloc(size_t size, unsigned int __nocast flags);
For normal compilation, this attribute expands to an empty string; it has
no effect. When the sparse tool is being
used, however, the __nocast attribute disables many of the
implicit type conversions performed by the compiler. In the
kmalloc() case, sparse will complain
whenever a signed integer value is passed as the flags argument.
Since the GFP flags passed to kmalloc() are explicitly defined as
unsigned values, they will not cause a warning to be issued. Any normal
integer variable or constant, however, will be flagged. Similarly, the use
of an integer value where an enumerated type is expected will be caught.
Thus, this little tweak should help with the automated detection of another
class of errors that the compiler will not find.
Comments (5 posted)
io_remap_page_range() has always been a strange function. Its
stated purpose is to portably map I/O memory into a process's address
space. Its prototype has always differed from one system to the next,
however, making portable use difficult. On most architectures it looks
like this:
int io_remap_page_range(struct vm_area_struct *vma, unsigned long virt_addr,
unsigned long phys_addr, unsigned long size,
pgprot_t prot);
The sparc64 architecture, however, defines it this way:
int io_remap_page_range(struct vm_area_struct *vma, unsigned long virt_addr,
unsigned long phys_addr, unsigned long size,
pgprot_t prot, int space);
The extra argument (space) was necessary to deal with the
inconvenient fact that I/O addresses on the sparc64 architecture would not
fit into an unsigned long variable.
The change from remap_page_range()
to remap_pfn_range() was done, in part, to address (so to speak)
this issue. Since remapping must be done on a page-aligned basis anyway,
there is no real point in using a regular physical address, which contains
the offset within the page. Said offset, after all, must be zero. By using a page frame
number instead, the range of the phys_addr argument is extended
far enough to reach into I/O memory on all architectures. The
remap_pfn_range() work stopped short of actually fixing the
io_remap_page_range() problem, however.
Randy Dunlap has now finished the task with a set of patches adding
io_remap_pfn_range():
int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long from,
unsigned long pfn, unsigned long size,
pgprot_t prot);
This function has the same prototype on all architectures. In-tree callers
have been modified, and the feature removal schedule has been updated:
io_remap_page_range() will go away in September, 2005.
Comments (none posted)
iSCSI is, for all practical purposes, a way of attaching storage devices to
a fast network interconnect and making them look like local SCSI drives.
There is a great deal of interest in iSCSI for high-end "storage area
network" applications, and a few competing iSCSI implementations exist for
Linux. Top-quality Linux iSCSI support would be a good thing to
have; it turns out, however, that iSCSI raises an interesting issue with
how the block subsystem works, especially when it must interact with the
networking layer.
When the system gets short of memory, one of the things it must do is to
force dirty pages to be written to their backing store, so that those pages
may be freed. This activity becomes doubly urgent when the system runs
completely out of memory. What happens, however, if the act of writing
those pages to disk also requires a memory allocation? In the iSCSI case,
those pages must be written via a TCP socket, so the networking layer must
be able to allocate enough memory to handle the TCP protocol's needs. If
the system is completely out of memory, where will this additional
allocation come from?
This particular problem was solved for the block layer some time ago with
the mempool mechanism. A mempool sets aside
a certain amount of memory for emergencies. When all else fails, the block
layer can allocate needed memory from the mempool; in that way, it is
guaranteed of being able to make at least some progress and free memory for
the system.
A similar mechanism could be put in place for network-based devices,
probably through a special socket option which would cause a mempool to be
set up for a specific connection. Attaching a mempool to a socket would
guarantee that the system could send data through that connection.
Unfortunately, in this case, using a mempool in this way does not solve the
entire problem.
When a block driver writes data to a local device, it can easily tell when
the operation has completed (and the relevant memory can be freed). In
many cases, it is simply a matter of
waiting for an interrupt and querying ports on the host controller. Newer,
more complex protocols can be handled by setting aside a small amount of
memory for replies from the controller. The controller is unlikely to
overwhelm the system with spurious messages; about the only thing that will
come back is responses to operations initiated by the system.
In the iSCSI case, a write to the device cannot be deemed to have succeeded
until the device sends back an acknowledgment, which will arrive as one of
possibly many TCP packets. If the system does not have memory available to
receive those packets and process the ACKs, it will be unable to complete
the write operations and free up more memory. So everything stalls, or, in
the worst case, deadlocks completely.
Just creating another mempool for incoming packets is not a solution,
however. The number of packets arriving on a network interface can be
huge, and the bulk of them are likely to be entirely unrelated to the
crucial outstanding iSCSI operations. A system which is in an
out-of-memory state simply cannot attempt to keep up with the full flood of
packets arriving on its network interfaces. But, if it is unable to deal
with the specific packets it is looking for, it may never get out of its
memory crunch.
Various possible solutions have been floated. Many network interfaces can
be programmed, in great detail, to drop uninteresting packets. So, when
the system hits a memory crunch, it could instruct its network drivers to
restrict the incoming packet stream to acknowledgments on high-priority
connections. This approach would work, but it would require complicated
communications between network drivers and the higher layers of the
system. Network adaptors are also limited in the amount of programming
they can handle; this limitation would restrict the number of iSCSI devices
which could be reliably supported by the system.
Another possible solution was posted by
Andrea Arcangeli. When an attempt to allocate memory for an incoming
packet fails, the system would perform the allocation from one of the
mempools (chosen at random) associated with sockets routed through the
relevant interface. Once the packet was fed into the networking layer, a
quick check would be made to see if the packet is, in fact, associated with
one of the high-priority sockets; if not, it would be quickly dropped and
the memory returned to the mempool. Packets belonging to high-priority
sockets would be processed normally, resulting, hopefully, in the
completion of write operations and the freeing of memory.
This discussion has not reached any sort of consensus, and has made it
clear that a number of issues arise when the block and networking layers
interact. The attempt to find a solution, in this case, is likely to be
deferred to the Kernel Summit, to be held in Ottawa this July. It should
be an interesting session.
Comments (3 posted)
Dave Airlie has launched
KernelPlanet.org, which is an
aggregation of weblog entries from several kernel hackers.
Comments (none posted)
Patches and updates
Kernel trees
- Andrew Morton: 2.6.12-rc1-mm2. Now includes davem's networking tree.
(March 24, 2005)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>