LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.27-rc1, released by Linus on July 28. Some 8100 changesets were merged during the 2.6.27 merge window; see the article below for a summary. Highlights for 2.6.27 will include lots of new drivers (including the gspca webcam drivers), support for hardware data integrity checking in the block layer, support for checkpointing and restoring of virtual machines in Xen, the ftrace tracing framework, mmiotrace, the tracehook patches, delayed allocation in ext4, the UBIFS filesystem, multiqueue networking, kexec jump, the extension of a number of system calls for safer user-space programming, the lockless page cache (see below), and much more. See the short-form changelog for details, or the long-form changelog for lots of details.

As of this writing, no patches have been merged into the mainline repository since the 2.6.27-rc1 release.

The current stable 2.6 kernel remains 2.6.26; there have not yet been any updates to this kernel, though the word is that the pile of patches for such an update is growing.

2.6.25.13 was released on July 28 with a number of networking-related fixes, some of which appear to address severe problems. 2.6.25.12, with a long list of fixes, was released on July 24.

Comments (none posted)

Kernel development news

Quotes of the week

Ok, so now that I've insulted you and your pets (they're ugly!), show me wrong, and then call me a d*ckhead. ("Linus - you're a d*ckhead, and you didn't understand the problem, so you're a _stupid_ d*ckhead. And my pet may be ugly, but yours _smells_ bad!").

Or say "Uh, yeah, we're morons, and here's the much better patch, and we won't do that again".

-- Linus Torvalds

Amazing! Your code, once plugged into the kernel proper, booted fine on 5 different x86 testsystems, it booted fine an allyesconfig kernel with MAXSMP and NR_CPUS=4096, it booted fine on allnoconfig as well (and allmodconfig and on a good number of randconfigs as well)....

[B]ecause v1 of your code was so frustratingly and mind-blowingly stable in testing (breaking a long track record of v1 patches in this area of kernel), and because the perfect patch does not exist by definition, i thought i'd mention that after a long search i found and fixed a serious showstopper bug in your code: you used "1ul" in your macros, instead of the more proper "1UL" style. The ratio between the use of 1ul versus 1UL is 1:30 in the tree, so your choice of integer literals type suffix capitalization was deemed un-Linuxish, and was fixed up for good.

-- Ingo Molnar

In anycase, it sounds like Tux3 is using many similar ideas. I think you are on the right track. I will add one big note of caution, drawing from my experience implementing HAMMER, because I think you are going to hit a lot of the same issues.

I spent 9 months designing HAMMER and 9 months implementing it. During the course of implementing it I wound up throwing away probably 80% of the original design outright.

-- Matthew Dillon. The whole thread is an interesting read in filesystem design.

The pure size of the -rc's _is_ making me a bit nervous, though. Sure, it means that we are good at merging it all, but I have to say that I sometimes wonder if we don't merge too much in one go, and even our current (fairly short) release cycle is actually too big.

Anyway, that's a discussion for some other event.

-- Linus Torvalds

I seem to be hearing a lot of silence over support for SSD devices. I have this vague worry that there will be a large rollout of SSD hardware and Linux will be found to have pants-around-ankles.
-- Andrew Morton

Comments (4 posted)

2.6.27 - the rest of the story

By Jonathan Corbet
July 29, 2008
The 2.6.27 merge window closed with the 2.6.27-rc1 release on July 28. Some 8100 changesets were merged this time around, making 2.6.27 another busy development cycle. A number of interesting things went in since last week's update; the most significant changes visible to Linux users include:

  • There are new drivers for ILI9320 LCD controller chips, Cobalt server LCD frame buffers, SH7760/SH7763 integrated LCD controllers, NXP pca9532 LED controllers, Philips PCA955x I2C LED controllers, WMI-based hotkeys on HP laptops, Maxim MAX73xx I2C port expanders, Micronas DRX3975D/DRX3977D DVB-T demodulators, DvbWorld 2102 DVB-S USB2.0 receivers, MaxLinear MxL5007T silicon tuners, Renesas SH7763 evaluation boards, Renesas Solutions AP-325RXA boards, Renesas R0P7785LC0011RL boards, and Atmel integrated touchscreens. Also added is "mISDN," a new, modular ISDN driver intended to replace older code for a number of ISDN cards. Support for using mISDN drivers remotely via an IP tunnel has been added.

  • The Palm T|X handheld computer is now supported.

  • The tmpfs filesystem has gained support for asynchronous I/O.

  • The hugetlbfs mechanism can now support multiple huge page sizes. There is a new directory (/sys/kernel/hugepages) with information on huge page allocations. The x86 (64-bit) architecture now supports 1GB pages; PowerPC can go to 16GB.

  • Most system calls which create file descriptors can now accept a set of flags; this change allows the race-free establishment of close-on-exec semantics, requesting non-blocking opens, and more. Developers wanting to use this capability will have to wait for a version of glibc which adds the requisite interfaces.

  • The unmaintained v850 architecture has been removed.

  • The kexec jump patch set, which uses the kexec mechanism as an alternative way of implementing suspend-to-disk, has been merged.

  • The omfs filesystem has been merged.

  • /proc now has a file (called syscall) for each process; when read, it displays the process's current system call and the supplied arguments.

  • Linux users hoping to upgrade their systems in the near future will be glad to know that a series of patches designed to make the kernel scale to 4096 processors has been merged.

Changes visible to kernel developers include:

  • The tracehook mechanism for defining static trace points (described in this article) has been merged, along with a number of trace points in the core kernel.

  • A new, lockless form of get_user_pages() has been added:

        int get_user_pages_fast(unsigned long start, int nr_pages, int write,
    			struct page **pages);
    

    Details of this interface can be found in this article, with the one note that early versions were called fast_gup() instead. (See also the related lockless page cache work, which was also merged).

  • The long-debated mmu-notifiers patch has been merged. The notifiers allow external memory management units (as may be seen in some graphics cards or in virtualized guests) to be told about decisions made by the core memory management code.

  • There is a new framework for debugging boot-time memory initialization; there's also "a few basic defensive measures" intended to prevent difficult-to-debug boot problems.

  • The new function:

        int object_is_on_stack(void *obj);
    

    returns a true value if the pointed-to object is on the current kernel stack.

  • There is a new macro for issuing warnings:

        WARN(condition, format, ...);
    

    It's much like WARN_ON() in that it will produce a full oops listing; the difference is the added printk()-style format string and arguments.

  • A new helper function:

        int flush_work(struct work_struct *work);
    

    waits for the specific workqueue job work to finish executing.

  • dma_mapping_error() and pci_dma_mapping_error() have new prototypes:

        int dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
        int pci_dma_mapping_error(struct pci_dev *hwdev, dma_addr_t dma_addr);
    

    In each case, they have gained a new argument specifying which device the mapping is being done for.

  • There are a couple of new radix tree functions:

        unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root, 
                                                 void ***results,
    					     unsigned long first_index, 
    					     unsigned int max_items);
        unsigned int radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, 
                                                     void ***results,
    						 unsigned long first_index,
    						 unsigned int max_items,
    						 unsigned int tag);
    

    They are useful for looking up multiple items in a single call.

  • Slab cache constructors no longer have a pointer to the cache itself as an argument; they now take a single void * pointer to the object itself.

  • The long list of Video4Linux2 ioctl() callbacks has been moved into its own structure (struct v4l2_ioctl_ops) which is pointed to by the ioctl_ops member of struct video_device.

Now begins the long task of finding and fixing all the bugs in all this new code. If the usual pattern holds, that process will take about two months, suggesting that we can expect 2.6.27 sometime in October.

Comments (7 posted)

The lockless page cache

By Jonathan Corbet
July 29, 2008
One of the biggest problems in kernel development is dealing with concurrency. In a system where more than one thing can be happening at once, one must always take care to keep multiple threads of control from interfering with each other and corrupting the system as a whole. In the same way that two roads become more dangerous when they intersect, connecting two or more processors to the same memory greatly increases their potential for the creation of mayhem.

Travelers to the US are often amused (or irritated) by the often-favored solution to roadway concurrency: putting in traffic lights. Such a light will indeed (if observed) eliminate the potential for a number of unpleasant race conditions within intersections, but at a performance cost: traffic going through the intersection must often stop and wait. This solution also scales poorly; as more roads (or lanes with different destinations) feed into the same intersection, each of them experiences more red-light time.

In kernel programming, the first tool for controlling concurrency - locks in various forms - are directly analogous to traffic lights. It is not coincidental that the name for a common locking primitive (semaphore) matches the name for a traffic light (semaforo) in a number of Latin-derived languages. Locks enforce exclusive access to a kernel resource in the same way that a traffic light enforces exclusive access to an intersection, and with many of the same costs. When too many processors end up waiting at the same lock, the performance of the system as a whole can suffer significantly.

There are two common approaches to mitigating scalability problems with locks. For many years after the 2.0 kernel came out, these problems were addressed through the creation of more locks, each controlling a smaller resource. Lock proliferation is effective, in that it reduces the chance that two processors will be trying to acquire the same lock at the same time. Since it works so well, this approach has led to the creation of thousands of locks in the Linux kernel.

Proliferation has its limits, though. Adding locks increases complexity; in particular, with more locks, the chances of creating occasional deadlock situations increase. Deadlocks can be avoided through the careful observation of rules on the acquisition of locks, and the order in which they are acquired in particular. But nobody will ever be able to sort out - and document - the proper relative locking order for thousands of locks. So kernel developers must make do with rules for some of the most important locks and the vigilance of the lockdep tool to find any remaining problems.

The other problem with lock proliferation is harder to get around, though. The acquisition of a lock requires writing a value to a location in shared memory. As each processor acquires a lock, it must change that value, which causes that processor to acquire exclusive access to the cache line holding the lock variable. The cache lines for heavily-used locks will fly around the system in a way that badly hurts performance, even if no processor ever has to wait for another to release the lock. Adding more locks will not fix this problem; instead, it will just create more bouncing cache lines and make things worse.

So, as the number of processors grows, the path to continued scalability must not include the wholesale creation of new locks; indeed, it requires the removal of locks in the most performance-critical paths. And that is what this whole long-winded introduction leads up to: the 2.6.27 kernel will include some changes by Nick Piggin which implement lockless operation in some important parts of the virtual memory subsystem. And those, in turn, will lead to faster operation on multiprocessor systems.

The first of these changes is a new function for obtaining direct access to user-space pages from the kernel:

	int get_user_pages_fast(unsigned long start, int nr_pages, int write,
			        struct page **pages);

This function works much like get_user_pages(), but, in exchange for some limits on its operation, it is able to do its job without acquiring the mmap semaphore; that, in turn, can lead to a 10% performance boost on "a threaded database workload." The details of how this function works were covered here last March (though the function was called fast_gup() back then), so we'll not repeat that discussion here.

The other big change is a set of patches which Nick has been carrying for quite some time: the lockless page cache. The page cache holds in-memory copies of pages from files on disk; its purpose is to improve performance by minimizing disk I/O. Looking up pages in the page cache is a common activity; it happens as a result of file I/O, page faults, and more. So it needs to be fast. In 2.6.26 kernels, each mapping (each connection between the page cache and a specific file in a filesystem somewhere) has its own lock. So processors will not normally contend for the locks unless they are operating on the same file. But locks for commonly-accessed files (shared libraries, for example) are likely to be frequently bounced between processors.

Most page cache operations are lookups - read-only operations which make no changes. In the lookup operation, the lock protects a few aspects of the task, including:

  1. A given page within the mapping must be looked up in the mapping's radix tree to find its location in memory (if any).

  2. If the page is resident in the page cache, it must have its reference count increased so that it will not be evicted before the code performing the lookup has done whatever it needs to do.

The radix tree, itself, is a complicated data structure; it must be protected from modification while the lookup is being performed. For certain, performance-critical parts of the radix-tree code, that protection is done through (1) some rules on what can be called when, and (2) the use of read-copy-update (RCU). As a result, the radix tree lookup can be done in a lockless manner.

There is still a problem, though: a given page may be evicted from the page cache (or simply moved) between steps (1) and (2) above. Should that happen, the second step will increment the reference count for a page which now belongs to a different mapping, and return an incorrect pointer. The kernel developers have, through lots of experience over many years, learned that system crashes resulting from data corruption are quite hard on throughput. So true scalability requires that this kind of scenario be avoided; thus the mapping semaphore, which prevents page cache changes from being made until the reference count has been properly updated.

Nick made an interesting observation here: it actually doesn't matter if the wrong reference count gets incremented as long as one ensures that the specific page mapping is still valid afterward. The result is a new, low-level page cache function:

    int page_cache_get_speculative(struct page *page);

If the given page has a reference count of zero, then the page has been removed from the page cache; in that case this function return zero and the reference count will not be changed. If the reference count is non-zero, though, it will be increased and a non-zero value will be returned.

Incrementing a page's reference count will prevent that page from being evicted or moved until the count goes back to zero. So kernel code which has incremented a specific page's reference count will thereby ensure that the page stays in its current state. In the page cache case, the code can obtain a speculative reference to a page found in a mapping's radix tree. But it does not, yet, know whether it actually got a reference to the page it was looking for - something may have happened between the radix tree lookup and the obtaining of the reference. So it must check - after the reference has been acquired - to be sure that it has the right page. If not, it releases the reference and tries again. Eventually it will either pin down the right page or verify that the relevant part of the file is not resident in memory.

Lockless operation forces a bit more care on the part of the page reclaim code, which is trying to get a page's reference count down to zero so that it can remove the page. Since there is no locking around the reference count now, the reclaim code must set it to zero while checking, in an atomic manner, that nobody else has incremented it. That is the purpose of the atomic_cmpxchg() function, which will only perform the operation if it does not collide with another processor. Since page_cache_get_speculative() will not increment the reference count if it is zero, the reclaim code knows that, by getting that count to zero, it now has exclusive control of the page.

The end result of all this is that a set of locking operations has been removed from the core of the page cache, improving the scalability of that code. There is, of course, a cost, in the form of trickier code with a more complex set of rules which must be followed. Chances are that we will see more of this kind of code, though, as the number of processors in our systems increases.

Comments (10 posted)

OLS: The state of Linux wireless networking

By Jake Edge
July 30, 2008

Kernel wireless maintainer John Linville outlined the past, present, and future of the Linux wireless stack on the first day of this year's Ottawa Linux Symposium. In his presentation, he ranged from early efforts, which were "a sore spot for Linux" to the future where it is likely that Linux will have support for some features before "that other OS". Along the way, he looked at various issues that wireless support in Linux faces, including vendor participation, suspend and resume, and regulatory issues.

Linville has been the maintainer Linux wireless for two and a half years since being recruited into the job by David Miller and Jeff Garzik. When he took over, wireless support was in disarray, as there were competing stacks to support different hardware. Users were faced with lots of pain in getting things working when "they just want their hardware to work" said Linville. Since that time, things have greatly changed.

The original wireless hardware was what is called "Full MAC hardware", where the implementation of the wireless protocols was handled by the hardware, generally in firmware. The drivers made these devices appear to be regular wired ethernet devices, though they did require some special configuration for SSID and the like. Because the hardware would enforce various regulatory requirements, vendors would generally work with the community in order to support the hardware.

All of that changed with the advent of "Soft MAC hardware"—which Linville likened to winmodems—where the CPU implements most of the protocol. It is a cheaper solution for vendors, but it requires an 802.11 stack for the kernel. The ieee80211 drivers came along to support the Intel Centrino wireless hardware, but they only supported those few devices. Johannes Berg added the ieee80211softmac driver that added some additional hardware support, but it was a kludgy solution. Since then, Linville said, folks have realized that it was "sort of a mistake to go down that road".

Enter the Devicescape stack. It was a feature rich 802.11 stack for Linux that was popular with developers. After some locking and SMP problems were resolved, it was merged into 2.6.22 as the mac80211 driver. Once that happened, wireless drivers started using it, to the point where Linville showed a chart of the current drivers, almost all of which use mac80211. "It's been a boon to us to pick up the mac80211 code."

One notable driver that does not support mac80211 is the libertas driver for the OLPC. Unlike most other current devices, it is a Full MAC device with special requirements. It has support for power saving modes that do not yet exist in mac80211. Because it is a mesh-networking device that still participates in forwarding network traffic when the system is powered down, it has needs that are not yet supported.

Drivers in progress was the next topic Linville addressed. Several of these are in need of developers to work on them, specifically for the Airgo chipset and Atmel USB chipset. The TI chipset drivers have had some questions raised about the reverse engineering process and may require a legal vetting similar to what the SFLC did for ath5k. Marvell is sponsoring development of a mac80211 based driver for its hardware. This driver may also support 802.11n which allows for greater range and higher speeds than current-generation 802.11.

Using data from LWN, Linville looked at the activity level of the wireless development in Linux. He was amazed to note "how much of the 2.6.26 kernel came through this laptop". Using his Signed-off-by as a proxy for wireless LAN commits, he noted 4.3-5.6% of the kernel commits in the last three releases (.24 through .26) were for wireless. In each kernel, wireless was either the fourth or fifth highest number of commits.

The compat-wireless-2.6 project is aimed at supporting newer hardware in older kernels. Because folks are wary of running kernel.org kernels or their distribution supports an older kernel—but they want to run with the latest hardware—the project backports wireless drivers to kernels as old as 2.6.21. It is a set of scripts and patches that build against the user's kernel. Unfortunately, the project may not last much longer as the multiqueue changes that have been merged for 2.6.27 may change the drivers enough that they will be infeasible to backport.

At the top of the list for new features is removal of the wireless extensions in favor of the new cfg80211 mechanism. According to Linville, "nobody likes wireless extensions, and nobody likes the existing tools". The wireless extensions have vague semantics, can have problems with race conditions, and because they are implemented by ioctl() calls, they encourage duplication of code in multiple drivers. cfg80211 will bring a much cleaner API along with fixing some existing bugs like the 31 character limit for SSIDs.

Access point (AP) mode is another feature that is coming. Typically, APs use similar or identical hardware to that in wireless MACs. For Soft MAC hardware, all that is needed is support on the CPU side for AP mode, which is coming for mac80211. Mesh networking, which has been popularized by the OLPC project, is also coming to mac80211. Cozybit has provided an implementation which will allow Linux to have a feature unavailable for Windows.

Areas that are needed, but are not yet being worked on was next on Linville's agenda. Suspend and resume support is "flawed for mac80211 due to connection management issues". Because mac80211 is unaware of suspend and resume, drivers must work around it by de-registering and re-registering with it, which can be slow. Adding support for suspend and resume is on the list, as is supporting power saving modes.

Linville went on to discuss three big issues that are largely outside of the control of the wireless hackers: firmware licensing, vendor participation, and regulatory concerns. Because drivers for Windows come with the firmware in the driver, many hardware vendors do not license the firmware blob separately. This means that it is unclear what can be done with those blobs. Certain vendors—Intel and Ralink were specifically called out—provide liberal licenses for their firmware. Users are encouraged to "vote with your dollars" by purchasing devices that either do not require firmware or that have a clear, free software friendly license.

Another consideration when deciding which vendors to support is whether they are engaged with the community. For the most part, all vendors but Broadcom are working with the wireless hackers by providing documentation and/or source code. Some are even providing dedicated developers to work on Linux drivers—Intel was the first, but both Atheros (which just released a driver for its ath9k hardware) and Marvell have also begun doing that.

Government regulations about what can and cannot be done in the unlicensed frequencies used by wireless are a concern that is frequently cited by vendors when refusing to work with the community. Unfortunately, their concerns are not completely without merit as hardware vendors are expected to ensure compliance with the regulations. "Non-compliance could be a huge loss" for those companies. As Linville points out, though, most vendors find a way to support Linux drivers.

In answer to a question, Linville said that most WiMAX and 3G wireless devices are Full MAC designs, so there should be little or no regulatory concern, which, in turn, means that Linux support should not be much of a problem—at least until Soft MAC devices come along. Overall, Linux wireless has come a long way, but there is lots still to do. One gets the sense that the wireless team is up to the task.

Comments (26 posted)

Patches and updates

Kernel trees

Build system

  • Sam Ravnborg: kbuild. (July 28, 2008)

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds