|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current stable 2.6 kernel is 2.6.17.7, released on July 24. This one adds a relatively long list of fixes for problems with networking, sound, and several other areas.

The current 2.6 prepatch remains 2.6.18-rc2. Fixes continue to accumulate in the mainline git repository, and the -rc3 release can be expected sometime soon.

There have been no -mm releases since 2.6.18-rc1-mm2 on July 14.

Comments (none posted)

Kernel development news

Quote of the week

Linux has a problem, which is that with success it is attracting people with more skill than what it started with, and it is not doing a very good job of handling that. In fact, it downright stinks at it, behaving in the worst way it could choose for handling that. We have lost quite a number of FS developers who just don't want to deal with people who know less than they do but are obnoxious and disrespectful to submissions because they enjoy powertripping.

-- Hans Reiser

Comments (25 posted)

Reconsidering network channels

When Van Jacobson presented his network channels idea at linux.conf.au last January, he set a bit of a fire in the Linux networking community. By making some significant changes to the processing path for incoming packets, and by pushing most of the work as close as possible to the destination application, Van was able to achieve significant performance improvements - eliminating as much as 80% of the processing overhead on multiprocessor systems. With numbers like that, it seemed like the question of whether Linux would incorporate channels need not even be asked.

Since then, however, reality has begun to make itself felt - something which reality is wont to do, sooner or later. Which is why David Miller's latest pronouncement on network channels reads like this:

Don't get too excited about VJ netchannels, more and more roadblocks to their practicality are being found every day.... All the original costs of route, netfilter, TCP socket lookup all reappear as we make VJ netchannels fit all the rules of real practical systems, eliminating their gains entirely.

The issue at hand had to do with the integration of channels and netfilter. The hope had been that packets could be identified and sorted into their respective channels before the netfilter (firewall) processing was done. Then said processing could be performed close to the application, on the same processor. It turns out, however, that netfilter can change the real destination of the packet. So packets must be filtered before entering a channel, and much of the performance benefit of using a channel is lost.

Alexey Kuznetsov has posted a detailed criticism of channels, asserting that most of the claimed benefits are illusory. Says Alexey:

It is an amazing toy. But I see nothing, which could promote its status to practical. Exokernels used to do this thing for ages, and all the performance gains are compensated by overcomplicated classification engine, which has to remain in kernel and essentially to do the same work which routing/firewalling/socket hash tables do.

Finally, it seems that many of the benefits of channels can be had by carefully taking advantage of the capabilities of modern hardware. In particular, an increasing number of devices can perform simple packet classification and (via targeted interrupts) direct packets to the CPU where the destination application is running. That technique will get rid of the cache misses caused by performing interrupt processing on one processor and protocol processing on another.

In the end, it appears that yet another seemingly bright scheme may not make the transition into real-world deployments. Some of its core concepts, such as using cache-friendly data structures and trying (even harder) to improve cache locality, will likely influence the future direction of the network stack, however. So, while there may not be a revolutionary new mechanism in the network stack's future, some of the promised performance improvements should eventually be realized anyway. And, as David says, "At least, there is less code to write."

Comments (5 posted)

revoke() and frevoke()

A system call found in some Unix variants is revoke():

    int revoke(const char *path);

This call exists to disconnect processes from files; when called with a given path, it will shut down all open file descriptors which refer to the file found at the end of that path. Its initial purpose was to defeat people writing programs that would sit on a serial port and pretend to be login. As soon as revoke() was called with the device file corresponding to the serial port, any login spoofer would find itself disconnected from the port and unable to fool anybody. Other potential uses exist as well; consider, for example, disconnecting a process from a file which is preventing the unmounting of a filesystem.

Linux has never had this system call, but this situation could change before too long; Pekka Enberg has posted an implementation of revoke() for review. Pekka has also added a second version:

    int frevoke(int fd);

This version, of course, takes an open file descriptor as its argument. In either case, the calling process must either own the file, or it must be able to override file permissions. So revoke() gives a process the ability to yank an open file out from underneath processes owned by other users, as long as that process owns the file in question.

Getting this operation right can be a little tricky, with the result that the current implementation makes some compromises which may not sit well with other developers. The process, simplified, is this:

  • The code loops through every process on the system; for each process, it iterates through the open file table looking for file descriptors corresponding to the file being revoked. Every time it finds one, it zeroes out the file descriptor entry (making that descriptor unavailable to its erstwhile owner). The file is not actually closed, however; instead, a list of files to be closed is created for later action.

    All of this will be rather slow, but that should not be a huge problem: revoke() is not a performance-critical operation. The memory allocation (to add an entry to the list of files to close) is a bit more problematic; if it fails, revoke() will abort partway through, having done an unknown amount of damage without having accomplished its goal.

  • Once all open file descriptors have been shut down, the files themselves can be closed. So revoke() steps through the list it created, closing each open file.

  • There is one sticky little problem remaining: some processes may have used mmap() to map the file into their address spaces. The revoke() call clearly has to do something about those memory areas, or it will not have completed the job. So a pass through all of the virtual memory areas associated with the file is required; for each one, the nopage() method is set to a special version which returns an error status.

    That change will keep a process from faulting in new pages from the revoked file, but does nothing about the pages which are already part of the process's address space. To fix those, it is necessary to wander through the page tables of each process having mapped the file, clearing out any page table entries referring to pages from that file.

An alternative approach can be seen in the forced unmount patch by Tigran Aivazian, which has been touched by a number of other developers over its fairly long history (its comments include a credit for the port to the 2.6 kernel). This patch has a different final goal - being able to unmount a filesystem regardless of any current activity - but it must solve the same problem of revoking access to all files on the target filesystem. Rather than clearing out file descriptors, this patch replaces the underlying file structure with a new one from the "badfs" filesystem. After this change, any attempted operations on the file will return EIO. Memory mappings are cleared with a direct call to munmap().

The final form of the patch may well be a combination of the two, providing both forced unmount and revoke() functionality. In the process, some of the remaining issues (such as how to perform safe locking without slowing down the highly-optimized read() and write() paths) will need to be worked out. But there is clearly demand for these features, so this work will probably proceed to eventual inclusion in the mainline.

Comments (4 posted)

OLS: A proposal for a new networking API

Ulrich Drepper has been the maintainer of the core glibc library since 1995; he also represents the community to the POSIX standardization effort. So, when Ulrich proposes a new user-space API, more than the usual number of people are likely to listen. Ulrich has been putting his mind to the problems of high-performance network I/O; the results were presented at his Ottawa Linux Symposium talk.

The current POSIX APIs are, increasingly, not up to the task. The socket abstraction has served us for a long time, but it is a synchronous interface which is not well suited to zero-copy I/O. POSIX does provide an asynchronous I/O interface, but it was never intended for use with networking, and does not provide the requisite functionality. So it has been clear for a while that something better is needed; the developers working on network channels have also been talking about the need for a new networking API.

There are three components to a new networking API, all of which will lead to a more complex - but much more efficient - interface for high-performance situations. The first of those is to address the need for zero-copy I/O. As the data bandwidth through the system increases, the cost of copying data (in CPU utilization and cache pressure) increase. Much of this cost can be avoided by transferring data directly between the network interface and buffers in user space. Direct user-space I/O requires cooperation from both the kernel and the application, however.

Ulrich proposes the creation of an interface for the explicit management of user-space DMA areas. Such an area would be created with a call that looks something like:

    int dma_alloc(dma_mem_t *handle, size_t size, int flags);

If all goes well, the result would be a memory area of the given size, suitable for DMA purposes. Note that user space gets an opaque handle type in return - there is, at this point, no virtual address which is directly accessible to the application.

To use a DMA area for network I/O, the application must associate it with a socket. The call for this operation would look like:

    int dma_assoc(int socket, dma_mem_t handle, size_t size, int flags);

There is still the issue of actually managing memory within this DMA area. An application which is generating data to send over the net would request a buffer from the kernel with a call like:

    int sio_reserve(dma_mem_t handle, void **buffer, size_t size);

If all goes well, the result will be a pointer (stored in *buffer) to an area where the outgoing data can be constructed. For incoming data, the application will receive a pointer to the buffer from the kernel (just how is something we'll get to shortly); the application will own the given buffer until it returns it to the kernel with:

    int sio_release(dma_mem_t handle, size_t size);

Before an application can start to use asynchronous network I/O, however, it must have a way to learn about the results of its operations. To that end, Ulrich proposes the addition of an event reporting API to the kernel. This mechanism, which he calls "event channels," would have an interface like:

    ec_t ec_create(int flags); /* Create a channel */
    ec_next_event();           /* Get the next event */
    ec_to_fd();                /* Send events to a file descriptor */
    ec_delay();                /* Wait for an event directly */

The exact form of this interface (like all of those discussed here) is subject to change. But the core idea is that it is a quick way for the kernel to return notifications of events (such as I/O completions) to user space. Most applications would be likely to use the file descriptor interface, which would allow events to enter an application's main loop via poll() or epoll_wait().

The final step is to make some extensions to the existing POSIX asynchronous I/O interface. The aiocb structure would be extended to include an event channel descriptor; that channel would be used to report the results of asynchronous operations back to user space. Then, an application could initiate data transmission with a call like:

    int aio_send(int socket, void *buffer, size_t size, int flags);

(One presumes there would be an aiocb argument as well, but Ulrich's slides did not show one). This call would start the process of transmitting data from the given buffer, with completion likely happening sometime after the call returns. For data reception, the call would look like:

    int aio_recv(int socket, void **buffer, size_t size, int flags);

The relevant point here being that buffer is a double pointer; the kernel would pick the actual destination for the data and tell the calling application where to look.

The result of all these changes would be a complete programming interface for high-performance, asynchronous network I/O. As an added bonus, the use of an event channel interface would simplify the work of porting applications from other operating systems.

All of these interfaces, says Ulrich, are simply a proposal and subject to massive change. The core purpose is to allow applications to get their work done while giving the kernel the greatest possible latitude to optimize the data transfers. This proposal is not the only one out there; Evgeniy Polyakov's kevent proposal is similar in many ways, though it does not have the explicit management of DMA areas. It may be some time before something is actually adopted - a new API will stay around for many years and should not be added in haste - but the discussion is getting started in earnest.

Comments (28 posted)

Patches and updates

Kernel trees

Greg KH Linux 2.6.17.7 ?

Core kernel code

Development tools

Junio C Hamano GIT 1.4.1.1 ?

Device drivers

Documentation

Rafael J. Wysocki swsusp status report ?

Filesystems and block I/O

Janitorial

Stephen Hemminger mark sk98lin driver for removal ?

Networking

Security-related

Kylene Jo Hall SLIM main patch ?

Miscellaneous

ricknu-0@student.ltu.se A generic boolean (version 4) ?
ricknu-0@student.ltu.se A generic boolean (version 5) ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds