User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.38-rc1, released on January 18. "It's been two weeks, and the merge window for 2.6.38 is thus closed. And an interesting merge window it has been." All told, a little over 7,600 changes were merged in this merge window. Some of them affect core code in fairly invasive ways, and there have been some significant regressions reported already. People testing 2.6.38-rc1 might want to be just a little more careful (and better backed up) than usual. See the full changelog for all the details.

Stable updates: there have been no stable updates released in the last week.

Comments (none posted)

Quotes of the week

Free software's awfully like sausages - wonderfully tasty, but sometimes you suddenly discover that you've been eating sheep nostrils for the past 15 years of your life.
-- Matthew Garrett

It is often the case that certain developers don't have a full understanding of the cost of the code they're writing, even when they're using C. Not everyone is forced to spend part of their life looking at the compiler output from their code and seeing what it actually does, although they ought to be.
-- David Woodhouse

Comments (3 posted)

Bypassing linux-next

By Jonathan Corbet
January 19, 2011
It has been almost three years since the creation of the linux-next tree; during that time, it has become an indispensable part of the kernel development process. By the time code is merged into the mainline during the merge window, it has already seen a fair amount of integration and compilation testing in linux-next - and even some actual run testing. That has helped to make the merge window run more smoothly. So it's not surprising that developers are getting increasingly grumpy when code is seen to be circumventing linux-next and creating problems in the mainline.

We've had a couple of examples of that grumpiness in the 2.6.38 cycle. When Al Viro posted his first VFS pull request, linux-next maintainer Stephen Rothwell complained that this was his first sighting of that code, despite the fact that it had apparently been around for a few months. Al is known for pulling together mainline submissions at the last minute, so this sort of thing is not entirely surprising; it remains to be seen whether he can be pushed into changing his ways.

The other complaint came after the merging of the transparent huge pages patch set, which went in by way of Andrew Morton's -mm tree. Tony Luck, having discovered that the ia64 architecture no longer built in the mainline, asked:

Didn't Andrew make some rash promise at kernel summit about stopping eating if "-mm" wasn't included in linux-next by the end of November? Must be getting pretty hungry by now.

Andrew responded that "It's taking a while - Stephen and I are discussing a plan." Integrating -mm was always going to be a bit of a challenge; linux-next is supposed to contain code which is ready for merging into the mainline, while -mm can carry under-development code for years. Until that gets worked out, though, memory management developers are going to be in a bit of a difficult position; there is no maintainer tree they can get into which feeds into linux-next. Those developers will need to either get their own trees into linux-next (an easy thing to do) or take the complaints when code which lived in -mm is seen by testers for the first time when it hits the mainline.

Comments (2 posted)

Kernel development news

2.6.38 merge window part 2

By Jonathan Corbet
January 19, 2011
As of the 2.6.38-rc1 release, some 7616 non-merge changesets had been pulled into the mainline kernel. A number of significant changes have been merged since last week's summary; the most interesting changes visible to users are:

  • The transparent huge pages feature has been merged. THP attempts to maximize the use of huge pages in the system (boosting performance) without requiring application changes or administrator overhead.

  • A new tool called turbostat has been added; it can be used to obtain various types of performance statistics from Intel processors. Also added is x86_energy_perf_policy, which can be used to tweak the performance/power usage tradeoff on Intel CPUs.

  • The taskstats API has been changed to use different alignments for returned values; this may break applications which were dependent on the old arrangement.

  • The kernel can now synchronize its internal time to an external pulse-per-second (PPS) signal with a high degree of accuracy. The kernel has also gained the ability to generate (and accept) PPS signals on a parallel port, assuming one can still find a computer with such a port.

  • The x86 architecture can now boot XZ-compressed kernels.

  • Basic support for multitouch panels has been added to the human input devices (HID) layer.

  • The kernel now has support for the RFC4106 AES-GCM cryptographic algorithm.

  • The fallocate() system call can now be used to punch holes in the middle of files. Currently this feature is supported by XFS and OCFS2.

  • The XFS filesystem supports the FITRIM ioctl(), allowing discard operations to be initiated from user space. "This is not intended to be run during normal workloads, as the freepsace btree walks can cause large performance degradation."

  • The LIO SCSI target core has been merged.

  • The block I/O bandwidth controller can now be used with hierarchical control groups.

  • The block layer has a new "events handling" mechanism. What that means is that detection of device events (the insertion of an optical disc, for example) can be done in the drivers, eliminating the need to poll devices from user space.

  • The device mapper dm-crypt target has a new "multikey" mode whereby different blocks can be encrypted with different keys. The crypt target is also now able to access encrypted partitions created with the out-of-tree loop-AES package.

  • The device mapper has gained the ability to manage RAID 4/5/6 volumes using the MD RAID drivers.

  • The clone() system call no longer honors the long-deprecated CLONE_STOPPED flag.

  • The btrfs filesystem has gained support for read-only snapshots and LZO compression.

  • New drivers:

    • Systems and processors: ALPHAPROJECT AP-SH4A-3A and AP-SH4AD-0A reference boards, Acme Systems srl FOX G20 boards, GeoSIG GS_IA18_S boards, and Atheros AR71XX/AR724X/AR931X SoCs.

    • Audio: HP t5325 audio devices, Realtek alc562x codecs, Wolfson Micro WM8770 and WM8995 codecs, Wolfson Micro WM8958 multi-band compressors, and Wolfson Micro WM8737 analog-to-digital converters.

    • Input: Austria Microsystem AS5011 joysticks.

    • Miscellaneous: NXP Semiconductors PN544 near-field communication chips, Oki Semiconductor ML7213 IOH GPIO controllers, Freescale MC13892 PMIC regulators, Freescale MCF548x watchdog timers, TI TPS6524X Power regulators, AMD/ATI SP5100 TCO timer/watchdog chipsets, Atheros AR71XX/AR724X/AR913X hardware watchdogs, nVidia TCO timer/watchdog devices, Intel EG20T platform controller hubs, and Maxim MAX17042/8997/8966 fuel gauges.

Changes visible to kernel developers include:

  •, a script which can automate the process of building, testing, and bisecting kernels, has been added to the tools directory.

  • The "%pK" format specifier can be used to print the value of potentially sensitive kernel pointers, especially in places like /proc files. The behavior of this specifier depends on the value of /proc/sys/kernel/kptr_restrict; a value of zero means that kernel pointers will be printed as usual, one causes pointers to be printed as zero for users without CAP_SYSLOG, and two hides the pointers for all users.

  • cdev_index() has been removed; since there are no in-kernel users, nobody is likely to notice.

  • The new function kref_test_and_get() will take a reference only if the current reference count is not zero.

  • Some new dentry operations have been added to support automounters within the VFS.

  • The fallocate() filesystem callback has been moved from struct inode_operations to struct file_operations.

With the 2.6.38 feature set complete, the process of stabilizing all of this new code can continue; expect a final 2.6.38 release sometime in late March.

Comments (4 posted)

Transparent huge pages in 2.6.38

By Jonathan Corbet
January 19, 2011
The memory management unit in almost any contemporary processor can handle multiple page sizes, but the Linux kernel almost always restricts itself to just the smallest of those sizes - 4096 bytes on most architectures. Pages which are larger than that minimum - collectively called "huge pages" - can offer better performance for some workloads, but that performance benefit has gone mostly unexploited on Linux. That may change in 2.6.38, though, with the merging of the transparent huge page feature.

Huge pages can improve performance through reduced page faults (a single fault brings in a large chunk of memory at once) and by reducing the cost of virtual to physical address translation (fewer levels of page tables must be traversed to get to the physical address). But the real advantage comes from avoiding translations altogether. If the processor must translate a virtual address, it must go through as many as four levels of page tables, each of which has a good chance of being cache-cold, and, thus, slow. For this reason, processors maintain a "translation lookaside buffer" (TLB) to cache the results of translations. The TLB is often quite small; running cpuid on your editor's aging desktop machine yields:

   cache and TLB information (2):
      0xb1: instruction TLB: 2M/4M, 4-way, 4/8 entries
      0xb0: instruction TLB: 4K, 4-way, 128 entries
      0x05: data TLB: 4M pages, 4-way, 32 entries

So there is room for 128 instruction translations, and 32 data translations. Such a small cache is easily overrun, forcing the CPU to perform large numbers of address translations. A single 2MB huge page requires a single TLB entry; the same memory, in 4KB pages, would need 512 TLB entries. Given that, it's not surprising that the use of huge pages can make programs run faster.

The main kernel address space is mapped with huge pages, reducing TLB pressure from kernel code. The only way for user-space to take advantage of huge pages in current kernels, though, is through the hugetlbfs, which was extensively documented here in early 2010. Using hugetlbfs requires significant work from both application developers and system administrators; huge pages must be set aside at boot time, and applications must map them explicitly. The process is fiddly enough that use of hugetlbfs is restricted to those who really care and who have the time to mess with it. Hugetlbfs is often seen as a feature for large, proprietary database management systems and little else.

There would be real value in a mechanism which would make the use of huge pages easy, preferably requiring no development or administrative attention at all. That is the goal of the transparent huge pages (THP) patch, which was written by Andrea Arcangeli and merged for 2.6.38. In short, THP tries to make huge pages "just happen" in situations where they would be useful.

Current Linux kernels assume that all pages found within a given virtual memory area (VMA) will be the same size. To make THP work, Andrea had to start by getting rid of that assumption; thus, much of the initial part of the patch series is dedicated to enabling mixed page sizes within a VMA. Then the patch modifies the page fault handler in a simple way: when a fault happens, the kernel will attempt to allocate a huge page to satisfy it. Should the allocation succeed, the huge page will be filled, any existing small pages in the new page's address range will be released, and the huge page will be inserted into the VMA. If no huge pages are available, the kernel falls back to small pages and the application never knows the difference.

This scheme will increase the use of huge pages transparently, but it does not yet solve the whole problem. Huge pages must be swappable, lest the system run out of memory in a hurry. Rather than complicate the swapping code with an understanding of huge pages, Andrea simply splits a huge page back into its component small pages if that page needs to be reclaimed. Many other operations (mprotect(), mlock(), ...) will also result in the splitting of a page.

The allocation of huge pages depends on the availability of large, physically-contiguous chunks of memory - something which Linux kernel programmers can never count on. It is to be expected that those pages will become available at inconvenient times - just after a process has faulted in a number of small pages, for example. The THP patch tries to improve this situation through the addition of a "khugepaged" kernel thread. That thread will occasionally attempt to allocate a huge page; if it succeeds, it will scan through memory looking for a place where that huge page can be substituted for a bunch of smaller pages. Thus, available huge pages should be quickly placed into service, maximizing the use of huge pages in the system as a whole.

The current patch only works with anonymous pages; the work to integrate huge pages with the page cache has not yet been done. It also only handles one huge page size (2MB). Even so, some useful performance improvements can be seen. Mel Gorman ran some benchmarks showing improvements of up to 10% or so in some situations. In general, the results were not as good as could be obtained with hugetlbfs, but THP is much more likely to actually be used.

No application changes need to be made to take advantage of THP, but interested application developers can try to optimize their use of it. A call to madvise() with the MADV_HUGEPAGE flag will mark a memory range as being especially suited to huge pages, while MADV_NOHUGEPAGE will suggest that huge pages are better used elsewhere. For applications that want to use huge pages, use of posix_memalign() can help to ensure that large allocations are aligned to huge page (2MB) boundaries.

System administrators have a number of knobs that they can tweak, all found under /sys/kernel/mm/transparent_hugepage. The enabled value can be set to "always" (to always use THP), "madvise" (to use huge pages only in VMAs marked with MADV_HUGEPAGE), or "never" (to disable the feature). Another knob, defrag, takes the same values; it controls whether the kernel should make aggressive use of memory compaction to make more huge pages available. There's also a whole set of parameters controlling the operation of the khugepaged thread; see Documentation/vm/transhuge.txt for all the details.

The THP patch has had a bit of a rough ride since being merged into the mainline. This code never appeared in linux-next, so it surprised some architecture maintainers when it caused build failures in the mainline. Some bugs have also been found - unsurprising for a patch which is this large and which affects so much core code. Those problems are being ironed out, so, while 2.6.38-rc1 testers might want to be careful, THP should be in a usable state by the time the final 2.6.38 kernel is released.

Comments (8 posted)

Reworking disk events handling

By Jonathan Corbet
January 19, 2011
A look inside any contemporary desktop-oriented system is likely to reveal a process which is steadfastly polling removable drives on the off chance that somebody might have removed or inserted a disk. Indeed, as your editor can attest, it can be hard to turn that polling off; there's little room in the world for strange people who have their own ideas of what they want to happen when they put a disk into a drive. Be that as it may, if the system is going to poll drives, it would be nice to do so in the best way possible. That is not currently the case, but, thanks to a patch by Tejun Heo, drive polling should be better in 2.6.38.

There are a few problems with how polling is done on Linux; these were nicely outlined by Tejun in the patch changelog. Polling on Linux requires opening the device; this is a somewhat heavyweight operation which does not naturally line up with other operations which might wake the processor. Opening the device in this way might interfere with other users; optical disk burning, in particular, is susceptible to this kind of problem. And polling the disk in this way generates a different set of commands than Windows uses; as Linux driver developers have discovered many times, behavior that differs from Windows is not well tested by vendors and tends to have unpleasant bugs. All that notwithstanding, user-space polling works well enough most of the time, but it would still be nice to make it better.

Tejun's patch works by moving the polling into the kernel. That makes the polling more efficient by removing the need to open the device and by allowing the kernel to delay polling wakeups until something else is going on as well. There is a new function added to struct block_device_operations which should be implemented by drivers:

    unsigned int (*check_events) (struct gendisk *disk, unsigned int clearing);

This function should check the device for new events and return a mask of any which were found. Two events are currently defined: DISK_EVENT_MEDIA_CHANGE and DISK_EVENT_EJECT_REQUEST, the latter of which is new. The clearing parameter is a mask of events which should be cleared until they happen again.

The old media_changed() block device operation still exists, but its use has been deprecated; drivers should be updated to use check_events() instead. Drivers should also, before adding a block device, initialize two new struct gendisk fields:

    unsigned int events;
    unsigned int async_events;

A mask of all events which can be reported by the device should be stored in events, while async_events should list the events which can be reported without needing to poll the device.

A new sysctl knob (block.events_dfl_poll_msecs) tells the kernel how often it should (by default) poll block devices. A value of zero (the default, currently) disables polling entirely. Polling intervals for specific devices can be set in their sysfs directories. If a device says that it can report all events asynchronously, and no polling interval has been explicitly set for it, that device will not be polled at all.

Since user space is no longer polling the device with this scheme, it needs a new way to find out when a disk event has happened. These events are now signaled via a uevent, meaning they can be handled via udev or some other utility which is watching those events. Note that any driver which handles asynchronous event reporting must call kobject_uevent_env() itself to send the event to user space. No driver in 2.6.38-rc1 does that; the first developer to add such a call may want to add a helper function to the core block code for that purpose.

Since polling is disabled by default, the kernel will behave the way it always has and existing user space applications will work. Once the user space environments have been changed to take advantage of this feature, they can turn on kernel polling and stop opening the devices themselves. That should lead to better power consumption and more reliable operation, which can only be a good thing.

Comments (2 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management


Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds