Advertisement September 17-19 2008, Portland, OR - Buffers, bootloaders, brew pubs, and bicycles
LWN Weekly Edition Front pageSecurity Kernel development Distributions Development Linux in the news Announcements ->One big page
This page Previous weekFollowing week |
Kernel developmentRelease status Kernel release status The current stable 2.6 kernel is 2.6.24, released by Linus on January 24. Highlights of this release include control groups (formerly process containers), the i386/x86_64 architecture merger, group scheduling in the CFS scheduler, network and PID namespaces, kernel markers, the removal of the modular security interface, and much more. See LWN's list of merged patches for more detail, or the always-amazing KernelNewbies Linux Changes page for much more detail.The 2.6.25 merge window is open, but the process of picking up patches is going relatively slowly due to the distractions of linux.conf.au. See the article below for a summary of what has been merged to date. For older kernels: 2.6.16.60 was released on January 27 with about a dozen fixes.
Kernel development news Quotes of the week
I skipped a lot of these patches because I just got bored of fixing
rejects. Now is a very optimistic time to be raising patches against
mainline.
-- Andrew Morton
I'm going to work on getting a unified devel tree operating: one which contains everyone's latest stuff and is updated daily. Basically it'll be -mm without a couple of the quilt trees. People can then prepare patches against that, as it seems that most can't be bothered patching against -mm, let alone building and testing it. More later.
Even Anton Blanchard's phone calls have a signed-off-by line.
-- AntonBlanchardFacts.com
What got into 2.6.25 As of this writing, some 3800 patches have been merged into the mainline git repository since the release of 2.6.24. That is fewer than one might have expected, but Linus's travel to linux.conf.au is slowing the process somewhat. Expect more than the usual amount of interesting stuff to be merged relatively late in the merge window period.User-visible changes include:
Changes visible to kernel developers include:
As of this writing, the merging process has just begun, so expect a long list again next week. Among other things, the x86 tree update, with 908 changesets, is waiting on the wings. There is quite a bit of code yet to be merged for this development cycle.
Avoiding the OOM killer with mem_notify Having applications that use up all the available memory can be a fairly painful experience. For Linux systems, it generally means a visit from the out-of-memory (OOM) killer, which will try to find processes to kill. As one would guess, coming up with rules governing which process to kill is challenging—someone, somewhere, will always be unhappy with a choice the OOM killer makes. Avoiding it altogether is the goal of the mem_notify patch. When memory gets tight, it is quite possible that applications have memory allocated—often caches for better performance—that they could free. After all, it is generally better to lose some performance than to face the consequences of being chosen by the OOM killer. But, currently, there is no way for a process to know that the kernel is feeling memory pressure. The patch provides a way for interested programs to monitor the /dev/mem_notify file to be notified if memory starts to run low. /dev/mem_notify is a character device that signals memory pressure by becoming readable. Interested programs can open the file and then use poll() or select() to monitor the file descriptor. Alternatively, signal-driven I/O can be enabled via the FASYNC flag and the system will deliver a SIGIO signal to the process when the device becomes readable. If it becomes readable, the process should free any memory that it can afford to give up. If enough memory is freed this way, the kernel will have no need to call in the OOM killer. The crux of the patch is how to decide that memory pressure is occurring. mem_notify modifies shrink_active_list() to look for movement of an anonymous page to the inactive list, which is an indication that some will likely be swapped out soon. When that occurs, memory_pressure_notify() (with the pressure flag set to 1) will be called for that zone. When the number of free pages for the zone increase above a threshold—based on pages_high and lowmem_reserve for the zone—memory_pressure_notify() is called again, but with the pressure flag set to 0, effectively ending the memory pressure event for that zone. If there are numerous processes waiting for a memory pressure notification, it could be counterproductive to wake them all at once—the "thundering herd" problem. To combat this, the patch set adds the ability to wake fewer processes than are waiting on the poll event, by adding the poll_wait_exclusive() function. poll_wait_exclusive() will in turn call add_wait_queue_exclusive() so that a member of the wake_up() family can be used that will limit the number of processes woken up. Previously, only poll_wait() was available, it uses add_wait_queue(), which does not provide this ability. Also, to reduce the frequency of processes waking up to reclaim memory, memory_pressure_notify() will only do that once every five seconds. The /proc/zoneinfo output has been changed to include the mem_notify status. This can be used by a human for diagnostic purposes or by a program to check the current status of zones for memory pressure. The embedded community has a lot of interest in seeing this feature get added to the kernel. Devices like phones and PDAs are often running close to their memory limits and the OOM killer is currently unavoidable when the user opens yet another application. With this patch in place, programs that use a lot of memory, but could get by with less, can be changed to free up their caches and the like when memory gets tight. As memory hungry programs get changed, other users will benefit as well. The patch, submitted by Kosaki Motohiro, has been through several iterations on linux-kernel. The work was originally started by Marcelo Tosatti, with the fifth version recently posted by Kosaki. Previous versions have been well received and with relatively few comments on this iteration, it would seem to be getting close to being merged.
A new block request completion API The 2.6 block layer has traditionally provided a pair of functions by which a driver could indicate that an I/O request had been completed. A call to end_that_request_first() signaled the transfer of a certain amount of data and would return a value indicating whether the request as a whole was complete. Once all sectors in a request had been transferred, it was up to the driver to pass the request to end_that_request_last() for final cleanup. There was also a function called simply end_request() which might or might not end the entire request, depending on how much data had been transferred. This API has worked for a long time, but it has occasionally proved confusing for driver developers. It was also hard for drivers to communicate useful error information with this interface. So, as of 2.6.25, there will be a new way for drivers to indicate request completion.After a block driver has transferred one or more sectors (or failed in the attempt), it should now make a call to:
int blk_end_request(struct request *rq, int error, int nr_bytes);
Where rq is the I/O request, error is zero or a negative error code, and nr_bytes is the number of bytes successfully transferred. If blk_end_request() returns zero, the request is fully processed and the driver can forget about it. Otherwise there are still sectors to be transferred and the driver should continue with the same request. blk_end_request() must acquire the queue lock to do its job. If the driver already holds that lock, it should call __blk_end_request() instead. Block drivers traditionally did a number of housekeeping tasks between calls to end_that_request_first() and end_that_request_last(). These include calling add_disk_randomness() to contribute to the entropy pool, returning any tags used with the request, and removing the request from the queue. All of that stuff is now done within blk_end_request(), so drivers can forget about it. The occasional driver had to carry out other tasks between the completion of the request and its removal from the queue. For drivers with this kind of special need, there is a separate function to call:
int blk_end_request_callback(struct request *rq,
int error,
int nr_bytes,
int (drv_callback)(struct request *));
In this version, drv_callback() will be called (without the queue lock held) between the completion of the request and its final cleanup. If the callback returns a non-zero value, that final cleanup will not be done. This function will always acquire the queue lock - there is no version for drivers which have already taken that lock. In general, though, the use of the callback functionality is likely to be a sign that the driver is being tricker than it really needs to be. This change was accompanied by a fair number of patches converting all in-tree drivers to the new interface. The old completion functions have been removed, so out-of-tree drivers will need updating before they will work with 2.6.25.
Patches and updates Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jake Edge |
Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.