Brief items
The current stable 2.6 kernel is 2.6.17.7,
released on July 24. This
one adds a relatively long list of fixes for problems with networking,
sound, and several other areas.
The current 2.6 prepatch remains 2.6.18-rc2. Fixes continue to
accumulate in the mainline git repository, and the -rc3 release can be
expected sometime soon.
There have been no -mm releases since 2.6.18-rc1-mm2 on July 14.
Comments (none posted)
Kernel development news
Linux has a problem, which is that with success it is attracting
people with more skill than what it started with, and it is not
doing a very good job of handling that. In fact, it downright
stinks at it, behaving in the worst way it could choose for
handling that. We have lost quite a number of FS developers who
just don't want to deal with people who know less than they do but
are obnoxious and disrespectful to submissions because they enjoy
powertripping.
-- Hans Reiser
Comments (25 posted)
When Van Jacobson
presented his
network channels idea at linux.conf.au last January, he set a bit of a
fire in the Linux networking community. By making some significant changes
to the processing path for incoming packets, and by pushing most of the
work as close as possible to the destination application, Van was able to achieve
significant performance improvements - eliminating as much as 80% of the
processing overhead on multiprocessor systems. With numbers like that, it
seemed like the question of whether Linux would incorporate channels need
not even be asked.
Since then, however, reality has begun to make itself felt - something
which reality is wont to do, sooner or later. Which is why David Miller's
latest pronouncement on network channels
reads like this:
Don't get too excited about VJ netchannels, more and more
roadblocks to their practicality are being found every day.... All
the original costs of route, netfilter, TCP socket lookup all
reappear as we make VJ netchannels fit all the rules of real
practical systems, eliminating their gains entirely.
The issue at hand had to do with the integration of channels and
netfilter. The hope had been that packets could be identified and sorted
into their respective channels before the netfilter (firewall) processing
was done. Then said processing could be performed close to the
application, on the same processor. It turns out, however, that netfilter
can change the real destination of the packet. So packets must be filtered
before entering a channel, and much of the performance benefit of using a
channel is lost.
Alexey Kuznetsov has posted a detailed
criticism of channels, asserting that most of the claimed benefits are
illusory. Says Alexey:
It is an amazing toy. But I see nothing, which could promote its
status to practical. Exokernels used to do this thing for ages, and
all the performance gains are compensated by overcomplicated
classification engine, which has to remain in kernel and
essentially to do the same work which routing/firewalling/socket
hash tables do.
Finally, it seems that many of the benefits of channels can be had by
carefully taking advantage of the capabilities of modern hardware. In
particular, an increasing number of devices can perform simple packet
classification and (via targeted interrupts) direct packets to the CPU
where the destination application is running. That technique will get rid
of the cache misses caused by performing interrupt processing on one
processor and protocol processing on another.
In the end, it appears that yet another seemingly bright scheme may not
make the transition into real-world deployments. Some of its core
concepts, such as using cache-friendly data structures and trying (even
harder) to improve cache locality, will likely influence the future
direction of the network stack, however. So, while there may not be a
revolutionary new mechanism in the network stack's future, some of the
promised performance improvements should eventually be realized anyway.
And, as David says, "At least, there
is less code to write."
Comments (5 posted)
A system call found in some Unix variants is
revoke():
int revoke(const char *path);
This call exists to disconnect processes from files; when called with a
given path, it will shut down all open file descriptors which
refer to the file found at the end of that path. Its initial purpose was
to defeat people writing programs that would sit on a serial port and
pretend to be login. As soon as revoke() was called with
the device file corresponding to the serial port, any login spoofer would
find itself disconnected from the port and unable to fool anybody. Other
potential uses exist as well; consider, for example, disconnecting a
process from a file which is preventing the unmounting of a filesystem.
Linux has never had this system call, but this situation could change
before too long; Pekka Enberg has posted an implementation of
revoke() for review. Pekka has also added a second version:
int frevoke(int fd);
This version, of course, takes an open file descriptor as its argument. In
either case, the calling process must either own the file, or it must be
able to override file permissions. So revoke() gives a process
the ability to yank an open file out from underneath processes owned by
other users, as long as that process owns the file in question.
Getting this operation right can be a little tricky, with the result that
the current implementation makes some compromises which may not sit well
with other developers. The process, simplified, is this:
- The code loops through every process on the system; for each process,
it iterates through the open file table looking for file descriptors
corresponding to the file being revoked. Every time it finds one, it
zeroes out the file descriptor entry (making that descriptor
unavailable to its erstwhile owner). The file is not actually closed,
however; instead, a list of files to be closed is created for later
action.
All of this will be rather slow, but that should not be a
huge problem: revoke() is not a performance-critical
operation. The memory allocation (to add an entry to the list of
files to close) is a bit more problematic; if it fails,
revoke() will abort partway through, having done an unknown
amount of damage without having accomplished its goal.
- Once all open file descriptors have been shut down, the files
themselves can be closed. So revoke() steps through the list
it created, closing each open file.
- There is one sticky little problem remaining: some processes may have
used mmap() to map the file into their address spaces. The
revoke() call clearly has to do something about those memory
areas, or it will not have completed the job. So a pass through all
of the virtual memory areas associated with the file is required; for
each one, the nopage() method is set to a special version
which returns an error status.
That change will keep a process from faulting in new pages from the
revoked file, but does nothing about the pages which are already part
of the process's address space. To fix those, it is necessary to
wander through the page tables of each process having mapped the file,
clearing out any page table entries referring to pages from that file.
An alternative approach can be seen in the forced
unmount patch by Tigran Aivazian, which has been touched by a number of
other developers over its fairly long history
(its comments include a credit for the
port to the 2.6 kernel). This patch has a different final goal - being
able to unmount a filesystem regardless of any current activity - but it
must solve the same problem of revoking access to all files on the target
filesystem. Rather than clearing out file descriptors, this patch replaces
the underlying file structure with a new one from the "badfs"
filesystem. After this change, any attempted operations on the file will
return EIO. Memory mappings are cleared with a direct call to
munmap().
The final form of the patch may well be a combination of the two, providing
both forced unmount and revoke() functionality. In the process,
some of the remaining issues (such as how to perform safe locking without
slowing down the highly-optimized read() and write()
paths) will need to be worked out. But there is clearly demand for these
features, so this work will probably proceed to eventual inclusion in the
mainline.
Comments (4 posted)
Ulrich Drepper has been the maintainer of the core glibc library since
1995; he also represents the community to the POSIX standardization
effort. So, when Ulrich proposes a new user-space API, more than the
usual number of people are likely to listen. Ulrich has been putting his
mind to the problems of high-performance network I/O; the results were
presented at his Ottawa Linux Symposium talk.
The current POSIX APIs are, increasingly, not up to the task. The socket
abstraction has served us for a long time, but it is a synchronous
interface which is not well suited to zero-copy I/O. POSIX does provide an
asynchronous I/O interface, but it was never intended for use with
networking, and does not provide the requisite functionality. So it has
been clear for a while that something better is needed; the developers
working on network channels
have also been talking about the need for a new networking API.
There are three components to a new networking API, all of which will lead
to a more complex - but much more efficient - interface for
high-performance situations. The first of those is to address the need for
zero-copy I/O. As the data bandwidth through the system increases, the
cost of copying data (in CPU utilization and cache pressure) increase.
Much of this cost can be avoided by transferring data directly between the
network interface and buffers in user space. Direct user-space I/O
requires cooperation from both the kernel and the application, however.
Ulrich proposes the creation of an interface for the explicit management
of user-space DMA areas. Such an area would be created with a call that
looks something like:
int dma_alloc(dma_mem_t *handle, size_t size, int flags);
If all goes well, the result would be a memory area of the given
size, suitable for DMA purposes. Note that user space gets an
opaque handle type in return - there is, at this point, no virtual address
which is directly accessible to the application.
To use a DMA area for network I/O, the application must associate it with a
socket. The call for this operation would look like:
int dma_assoc(int socket, dma_mem_t handle, size_t size, int flags);
There is still the issue of actually managing memory within this DMA area.
An application which is generating data to send over the net would request
a buffer from the kernel with a call like:
int sio_reserve(dma_mem_t handle, void **buffer, size_t size);
If all goes well, the result will be a pointer (stored in *buffer)
to an area where the outgoing data can be constructed. For incoming data,
the application will receive a pointer to the buffer from the kernel (just
how is something we'll get to shortly); the application will own the given
buffer until it returns it to the kernel with:
int sio_release(dma_mem_t handle, size_t size);
Before an application can start to use asynchronous network I/O, however,
it must have a way to learn about the results of its operations. To that
end, Ulrich proposes the addition of an event reporting API to the
kernel. This mechanism, which he calls "event channels," would have an
interface like:
ec_t ec_create(int flags); /* Create a channel */
ec_next_event(); /* Get the next event */
ec_to_fd(); /* Send events to a file descriptor */
ec_delay(); /* Wait for an event directly */
The exact form of this interface (like all of those discussed here) is
subject to change. But the core idea is that it is a quick way for the
kernel to return notifications of events (such as I/O completions) to user
space. Most applications would be likely to use the file descriptor
interface, which would allow events to enter an application's main loop via
poll() or epoll_wait().
The final step is to make some extensions to the existing POSIX asynchronous
I/O interface. The aiocb structure would be extended to include
an event channel descriptor; that channel would be used to report the
results of asynchronous operations back to user space. Then, an
application could initiate data transmission with a call like:
int aio_send(int socket, void *buffer, size_t size, int flags);
(One presumes there would be an aiocb argument as well, but
Ulrich's slides did not show one). This call would start the process of
transmitting data from the given buffer, with completion likely
happening sometime after the call returns. For data reception, the call
would look like:
int aio_recv(int socket, void **buffer, size_t size, int flags);
The relevant point here being that buffer is a double pointer; the
kernel would pick the actual destination for the data and tell the calling
application where to look.
The result of all these changes would be a complete programming interface
for high-performance, asynchronous network I/O. As an added bonus, the use
of an event channel interface would simplify the work of porting
applications from other operating systems.
All of these interfaces, says Ulrich, are simply a proposal and subject to
massive change. The core purpose is to allow applications to get their
work done while giving the kernel the greatest possible latitude to
optimize the data transfers. This proposal is not the only one out there;
Evgeniy Polyakov's kevent
proposal is similar in many ways, though it does not have the explicit
management of DMA areas. It may be some time before something is actually
adopted - a new API will stay around for many years and should not be added
in haste - but the discussion is getting started in earnest.
Comments (28 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>