|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current stable 2.6 release is 2.6.16.5. 2.6.16.2 was released on April 7 with a fair number of fixes; 2.6.16.3 and 2.6.16.4 - each containing a single security fix - both came out on April 11, and 2.6.16.5, with a pair of x86-64 fixes, was released on April 12.

The current 2.6 prepatch remains 2.6.17-rc1; no prepatches have been released over the last week. 2.6.17-rc2 would appear to be imminent, however, and may be out by the time you read this.

The patches merged since 2.6.17-rc1 are mostly fixes, but there are a few more substantive changes, including a simplified form of the scheduler starvation avoidance patch, some tweaks to the memory overcommit algorithm, the removal of the obsolete blkmtd driver, the removal of the unmaintained Sangoma WAN drivers, and a new "kernel internal pipe" object (and other changes) for the splice() system call.

The current -mm tree is 2.6.17-rc1-mm2. Recent changes to -mm include a new set of patches for 64-bit resource tracking and some core dump code tweaks.

Comments (none posted)

Kernel development news

Containers and lightweight virtualization

"Virtualization" is the act of making a set of processes believe that it has a dedicated system to itself. There are a number of approaches being taken to the virtualization problem, with Xen, VMWare, and User-mode Linux being some of the better-known options. Those are relatively heavy-weight solutions, however, with a separate kernel being run for each virtual machine. Often, that is exactly the right solution to the problem; running independent kernels gives strong separation between environments and enables the running of multiple operating systems on the same hardware.

Full virtualization and paravirtualization are not the only approaches being taken, however. An alternative is lightweight virtualization, generally based on some sort of container concept. With containers, a group of processes still appears to have its own dedicated system, but it is really running in a specially isolated environment. All containers run on top of the same kernel. With containers, the ability to run different operating systems is lost, as is the strong separation between virtual systems. Thus, one might not want to give root access to processes running within a container environment. On the other hand, containers can have considerable performance advantages, enabling large numbers of them to run on the same physical host.

There is no shortage of container-oriented projects. These include relatively simple efforts like the BSD jail module through more thorough efforts like Linux-VServer, OpenVZ, and the proprietary Virtuozzo (based on OpenVZ) offering. Many of these projects would like to get at least some of their code into the kernel and shed the load of carrying out-of-tree patches. There is little interest, however, in merging code which only supports some of these projects. The container people are going to have to get together and work out some common solutions which they can all use.

It appears that this is exactly what the container developers are doing. A loose agreement has been put in place wherein developers from a few projects will discuss proposed changes and jointly work them into a form where they meet everybody's needs. Once a particular patch has reached a point where all of the developers are willing to sign off on it, it can be forwarded for eventual merging into the mainline.

The more complex and intrusive changes, such as PID virtualization, appear to be on hold for now. Instead, it looks like the first jointly-agreed patch might be the UTS namespace virtualization patch. The aim of the patch is relatively straightforward: it allows each container (as represented by a family tree of processes) to have its own version of the utsname structure, which holds the node name, domain name, operating system version, and a few other things. In essence, it replaces a single global structure with multiple structures attached at various places in the process tree. It still requires a five-part patch, with every reference to the global system_utsname structure replaced by a call to the new utsname() function.

Longer-range plans call for the virtualization of every global namespace in the kernel, including SYSV IPC, process IDs, and even netfilter rules. There was an interesting discussion on the virtualization of security modules; some think that each container should be able to load its own security policy, while others argue in favor of a single system security policy which is aware of (and able to use) containers. Unsurprisingly, SELinux is already equipped with a type hierarchy mechanism which can be used with containers in the single-policy approach.

Containers might still prove to be a hard sell with some developers, who will see them as complicating access to many internal kernel data structures without adding a whole lot of value. It is clear, however, that there is a demand for this sort of lightweight virtualization - OpenVZ, alone, claims to be running over 300,000 virtual environments. So the pressure to standardize this code and move it into the mainline will only grow over time. Once they are clean enough to satisfy the development community, pieces of the container concept are likely to be merged.

Comments (6 posted)

tee() with your splice()?

The new splice() system call was covered here last week. As was predicted then, this new kernel API has continued to evolve; many of the non-fix patches going into the post-2.6.17-rc1 mainline involved changes to splice().

For starters, the prototype of the splice() system call has changed:

    long splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out,
                size_t len, unsigned int flags);

The two offset values (off_in and off_out) are new; they indicate where each file descriptor should be positioned prior to beginning the transfer of data. Note that these offsets are passed via pointers; user space can use a NULL pointer to indicate that the current offset should be used. Note also that these offsets do not work like the offsets in pread() or pwrite(): they will actually change the offset associated with the file descriptor. Providing an offset for a file descriptor associated with a pipe is an error.

Internally, the splice() code has seen a couple of interesting changes. One of them (in the regular pipe code, actually) is the creation of a new pipe_inode_info structure to represent the core machinery behind a pipe. This structure can stand apart from the normal inode structure. Many of the internal interfaces have been changed to use the new structure, including the new methods in the file_operations structure:

    ssize_t (*splice_write)(struct pipe_inode_info *pipe, 
                            struct file *out, size_t len, 
			    unsigned int flags);
    ssize_t (*splice_read)(struct file *in, struct pipe_inode_info *pipe, 
                           size_t len, unsigned int flags);

Since there are still few implementations of these methods in the kernel, the changes are not particularly disruptive.

Next in the list is support for directly splicing two file descriptors where neither is a pipe. This functionality is not (yet) available to user space via splice(), but it is used internally to implement sendfile() with the splice() mechanism. The direct splicing is implemented using a hidden pipe_inode_info structure (i.e. a pipe); it is stored in the new splice_pipe field of the task structure, so each process can only have one such connection running at any given time.

One patch which has not been merged - and will likely wait until 2.6.18 at this point - is the tee() system call:

    long tee(int fdin, int fdout, size_t len, unsigned int flags);

This call requires that both file descriptors be pipes. It simply connects fdin to fdout, transferring up to len bytes between them. Unlike splice(), however, tee() does not consume the input, enabling the input data to be read normally later on by the calling process. Jens Axboe provides an example implementation of the user-space tee utility, which comes down to a couple of calls:

    len = tee(STDIN_FILENO, STDOUT_FILENO, INT_MAX, SPLICE_F_NONBLOCK);
    splice(STDIN_FILENO, out_file, len, 0);

The input data will be written both to the standard output and the file represented by out_file without ever being copied to or from user space. To be sure of copying the entire input data stream, the application must perform the above calls within a loop, of course; see the full example at the end of the tee() patch.

This call is quite new, and may well change before it makes it into the mainline. Among other things, it might get a new name, since something as simple as tee() may already be in use in a number of applications.

Comments (6 posted)

The 2006 Wireless Networking Summit

Once upon a time, setting up a Linux system was a long and problematic process, with no assurance that a given system would work without great pain. Most of those problems have been overcome for years, and, to a great extent, a system can be expected to "just work" with Linux. A few problematic areas remain, however, and wireless networking is one of them. Even when the available hardware is supported (often not the case), making a wireless connection work in a fully-featured way can often be a challenge.

[Group Photo]
Group photo: 1000x620, 2200x1360
A lot is happening in the wireless area, however. To help things along, the folks at the Open Source Development Labs hosted a summit for wireless networking developers on April 6 and 7 at OSDL headquarters in Portland. This event brought together a diverse mix of developers from around the world, many of whom had never met before. Its purpose - to chart a path forward for the creation of a reasonable Linux wireless networking implementation - appeared to have been largely achieved.

Your editor was fortunate enough to be able to attend this meeting. The following report is an attempt to summarize the conclusions from the summit - it is not a set of detailed minutes, and your editor will engage in some chronological reordering along the way. Hopefully the result will provide a sense for where things stand, and where they are likely to go in the near future.

History

As John Linville (the recently-named wireless networking maintainer) noted in a conversation with your editor, early wireless adapters were marketed as "wireless Ethernet," and Linux kernel developers treated them as a sort of slow, unreliable, fiddly Ethernet adapters. But wireless is not Ethernet in any way - it is a completely different set of networking standards with its own quirks, special features, and distinct needs. Treating wireless as a form of Ethernet slowed support for those special features, and, more importantly, impeded the development of any sort of internal kernel support for wireless. Each developer who set out to write a driver for a wireless adapter ended up implementing everything from the beginning. So there was no general wireless API, no comprehensive support of wireless features, and a great deal of divergence and duplication of code between drivers.

[Jean Tourrilhes] In 1997, Jean Tourrilhes decided to do something about this situation. The result was WE-1 - the first version of the Linux wireless extensions. There was still no 802.11 standard at the time, but the WE API enabled the configuration and operation of wireless adapters with a single set of tools. Jean's wireless tools are still the core utilities for managing wireless adapters, though the graphical interfaces are replacing the wireless tools for most users.

Development of the wireless extensions continued, with WE-9 - supporting the new 802.11 standard - being released in 1999. WE-18, merged last year, added support for WPA ("WiFi Protected Access"). The current revision, WE-20, adds a new, netlink-based interface as a future replacement for the current ioctl() API.

Though development continues, there appears to be a general, shared feeling that the wireless extensions are heading toward the end of their useful life. A replacement API - which does not exist yet - would work with the entire wireless networking stack, rather than being an interface directly to the low-level drivers. Regardless of how that plays out, however, the wireless extensions are likely to be around for a long time to come.

The current status

The current effort to create a proper wireless stack for Linux started in 2004, when Jeff Garzik announced the creation of a special wireless tree, initially seeded with the HostAP code. The merging of HostAP enabled support of some relatively current networking cards and the use of a Linux system as a wireless access point. The creation of this tree did help to get things going, but HostAP has not turned out to be everything that had been hoped for. In particular, there is no support in HostAP for cards which need software MAC ("softmac") implementations. But many contemporary cards rely upon the host software for many low-level operations; these cards can not be supported by the wireless stack found in the current (2.6.16) kernel.

[John Linville] The result is, as John Linville put it, a Linux wireless implementation which supports "anything which is obsolete." Some cards are supported, some better than others. Most noteworthy among current hardware is the set of Intel IPW drivers, which, thanks to Intel, have very good support in the kernel - but these adapters do not need softmac support.

What is lacking, at this point, is a small list of mildly desirable features, including support for much widely-used hardware. Ease of use is also lacking - despite improvements in the graphical tools, configuring a wireless connection can still be a painful procedure. Perhaps the best demonstration of these two problems was to be found at the summit itself, where about 25% of the participants ended up using an Ethernet cable to plug directly into the OSDL network.

Other problems include consistency (or the lack thereof) across hardware - there are still a number of adapter-specific APIs in the kernel and in the out-of-tree drivers. The documentation of APIs is, well, nonexistent; a complaint was heard that Linux Device Drivers does not describe how to write a wireless driver. There is no coordinated process for extending APIs. Quality of service support is not present - an issue we'll return to shortly. There are no driver test suites in general circulation. And the whole regulatory issue looms over the wireless networking arena, and is the largest single cause of out-of-tree (or nonexistent) wireless drivers. Many vendors simply do not feel that they can release programming information or free drivers and remain compliant with regulatory regimes worldwide.

Meanwhile, the upcoming 2.6.17 kernel will see some improvements in its wireless support. John merged one of the many softmac implementations out there, on the theory that it was one of the most active projects and that it would help to support driver development. The bcm43xx (Broadcom) driver, which uses softmac, was also merged, and there are a couple of other softmac-based drivers under development. Even so, the consensus appears to be that softmac is not the way forward; that, instead, the Devicescape stack is the real future of Linux wireless.

Devicescape

Devicescape is a company which offers a number of products and services around wireless networking. In developing it offerings, Devicescape created its own, Linux-based 802.11 stack with a number of nice features - including good softmac and WPA support . This stack was recently released under the GPL and has been fixed up for the kernel by Jiri Benc. It is regarded by many as being the best of the available free stacks.

When Jeff Garzik maintained the wireless tree, he took a firm position against moving to the Devicescape stack, stating instead that the in-kernel code should be evolved toward the needed capabilities. He appears to have found himself in the minority, however, and John Linville seems poised to merge this stack for a future kernel release. He maintains a separate development tree which includes Devicescape, and some drivers (notably bcm43xx) have been ported to this stack. Nobody at the summit was heard to argue against merging Devicescape.

Devicescape hacker Simon Barber talked about this code for a bit, and a separate breakout session addressed it as well. This stack is a large body of code. The freely-released code available now includes the 802.11 stack, the "openap" access point code, and a link-layer bridging module. Work which will be released soon includes improvements to the hostapd daemon (802.11g support, among other things; this code is being merged now); bridging and VLAN integration, and various improvements to Ethereal for wireless developers. There is also "a complete home gateway distribution" in the works. There is the inevitable web portal being put together to provide access to all this code.

Quite a bit of work is foreseen for the Devicescape stack. It is composed, internally, of a long list of handler functions which deal with frames (both data packets and 802.11 management frames) on their way to and from the adapter. Future plans call for enabling loadable modules to plug in their own handler functions. More of the management code may also eventually be moved out to user space. To that end, some additional management capabilities will be added to the hostapd daemon, which handles authentication and management tasks. Merging hostapd with wpa_supplicant, which handles the client side of the authentication process, is envisioned; evidently a number of things become easier when the two functions are merged into the same process.

There is also a great deal of complexity coming with the long list of future 802.11 standards. These standards will require support as they are adopted.

One interesting area of development has to do with quality of service support. 802.11 defines four service levels: "voice," "video," "best effort," and "background." There is a priority range for each service level, and the ranges overlap. All voice packets will go out before any background packets, but the rest of the levels will share the available bandwidth. With proper QoS support, a wireless user can carry on a voice-over-IP conversation, stream video of the latest "breaking news" celebrity sighting from CNN, grab a new kernel by FTP, and distribute materials (best not to ask what) via Bittorrent. Each activity can operate at its own quality of service level, and all should get the best available performance.

Some wireless network adapters have quality of service support in the form of four separate transmit queues. If the host places each packet in the appropriate queue, the adapter will divide the available bandwidth between them in a way which respects each level's service quality. The problem is that the Linux networking stack only supports one transmit queue per device. This presents a problem when one of the four device-level queues fills up. There is no way to tell the kernel that no more background packets can be queued, but there is still space for voice packets, for example; the only thing the driver can do is to stop the queue for all packets.

The Devicescape hackers have worked around this problem using the traffic control mechanism built into the networking stack, which normally operates at a level not seen by driver code. By creating a separate internal queue for each service level, the Devicescape stack can, for all practical purposes, implement a separate transmit queue for each service level. Even better, it becomes possible to configure policy - which types of traffic get which service level - from user space using the normal traffic control tools. What would be nice, however, would be to generalize this use of the queueing discipline code, and to make it available for other sorts of hardware as well.

Another area requiring work is user-space API definition. There is no well-understood API which, for example, can be used by a graphical wireless management utility to talk with the networking stack and with processes like hostapd. There isn't really even a discussion of how such an API should look at the moment.

Other open issues include the usual regulatory hassles, the lack of a user-space MAC-layer management environment, the need for better scanning, support for adapters which perform MAC management in hardware, power management support, and a rework of the configuration interface. Configuration is handled by way of ioctl() calls and a /proc interface. It was noted, in a pointed manner, that the Devicescape code will not make it into the mainline as long as it contains /proc files. It seems that the Devicescape stack also needs some work before it will operate properly on SMP systems.

Finally, adding proper wireless support to the kernel will involve the creation of a specific net_device type for 802.11 devices. An 802.11-specific sk_buff structure should also be defined. Current code still uses the Ethernet types and drags along the extra needed information on the side.

The biggest open issue, however, may be this: what happens to the just-merged softmac code when Devicescape is merged? There is much duplication of functionality there, and nobody is thrilled by the idea of having to maintain two separate 802.11 stacks indefinitely into the future. There is a clear parallel with the OSS and ALSA sound drivers; ALSA was supposed to replace OSS, but removing the OSS drivers has proved to be a difficult thing to do. It is not clear what can be done to make removing softmac any easier.

Tools

The summit was mostly attended by kernel-oriented developers, but there was also some discussion of user-space tools; NetworkManager hacker Daniel Williams was present. It is recognized by all that, while the quality of the available tools has improved significantly in the last couple of years, there is some ground to cover yet. In particular, configuring an interface can be relatively painless when things go well, but, as soon as something doesn't quite work, the whole experience falls apart.

Improving the situation will require support from the kernel side. When things go wrong, user space needs to know just what the problem is. But there is no consistent set of error codes returned by the kernel to indicate, for example, that the required adapter firmware is not present, or the provided WEP key is not valid. Some drivers support more of the current API than others, which does not help, and API documentation is generally not available. Better scanning support would also be useful.

Hardware support

While getting the networking stack and user-space tools into shape is necessary, improving hardware support is also a necessary step toward a Linux wireless implementation which truly "just works." Some hardware (Intel, others) is well supported now, others (Broadcom) will be supported soon. Some, such as the Atheros chipset, may be a long time in coming. The existing Atheros driver (as found in OpenBSD) appears to be severely tainted by code of questionable origin, to the point that its chances of being merged into Linux are about zero. There is an effort to document the Atheros hardware from the currently-available code, enabling a clean-room driver implementation in the future, but there is quite a bit of work yet to be done.

The regulatory compliance issue came up again in this context. Some adapters (such as Atheros) are, for all practical purposes, general-purpose radios which can be programmed to operate far out of the 802.11 specification. When a free driver is developed for such hardware, it would be a Good Thing to be sure that it runs the hardware in a manner compliant with the applicable regulations, even if it cannot necessarily be certified as such. That sort of testing requires specialized equipment, however, and is evidently a multi-day process. The necessary equipment does exist at companies like Nokia and at some universities, but there is currently no process for obtaining access to that equipment for compliance testing.

Much of the current driver work is done outside of the mainline tree, and the kernel developers would like to see that changed. Once code gets into the mainline, it is easier for others to review and improve. Greg Kroah-Hartman encouraged driver developers to merge their code as early as possible, even if it doesn't work yet.

Communications regarding wireless drivers, it was agreed, would remain on the netdev mailing list for now. If, at some point, that conversation threatens to overwhelm other traffic on netdev, a new list can be created. There will also likely be a web set put together for wireless driver information in the near future.

Other issues

One purpose behind the summit was simply to try to pull more of the relevant developers into the wider kernel process. To that end, there was a talk on source control systems - git and quilt in particular. The "merge early" approach was advocated many times.

Stephen Hemminger gave a talk on the state of the bridging code. Bridging is of interest to wireless developers - it can be used for connection sharing and mesh networking applications. To that end, the bridging code is likely to be reworked and much of it moved to user space. Just like routing is mostly handled by user-space daemons now, bridging management - including the spanning tree maintenance - will move to user space in the future.

Some representatives of the Personal Telco Project were brave enough to compete with a delivery of pizza for the developers' attention at lunch time. These folks have put together a network of over 100 Linux-based free wireless hotspots around Portland. They had a number of requests of the kernel developers, including free Atheros drivers which don't crash the system and good, zero-configuration mesh networking. This is an interesting project which shows the power of what a few "unemployed geeks" can do.

Overall, the wireless summit was an optimistic event. While the shortcomings of Linux wireless support were well recognized and understood, there was also a clear sense that not only could the problems be solved, but that many of the solutions were already well advanced. If all goes according to plan, the day when Linux wireless "just works" is not that far off.

Comments (30 posted)

Patches and updates

Kernel trees

Andrew Morton 2.6.17-rc1-mm2 ?
Alexey Dobriyan 2.6.17-rc1-kj ?
Greg KH Linux 2.6.16.2 ?
Greg KH Linux 2.6.16.3 ?
Greg KH Linux 2.6.16.4 ?
Greg KH Linux 2.6.16.5 ?
Con Kolivas 2.6.16-ck4 ?

Architecture-specific

Gerd Hoffmann x86_64: SMP alternatives ?

Build system

Roman Zippel kconfig patches ?

Core kernel code

Development tools

Con Kolivas Kernbench v0.41 ?
Junio C Hamano GIT 1.2.6 ?
Petr Baudis Cogito-0.17.2 ?
Marco Costalba qgit-1.2rc1 ?
Bryan O'Sullivan Mercurial 0.8.1 released ?

Device drivers

Documentation

Michael Kerrisk man-pages-2.29 is released ?

Filesystems and block I/O

Memory management

Networking

Patrick McHardy : Netfilter Update ?

Security-related

Virtualization and containers

Serge E. Hallyn uts namespaces: Introduction ?

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds