|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The stable kernel team released the 2.6.17.1 kernel with a security fix for the SCTP network protocol, and the 2.6.16.21 kernel with the same SCTP fix, and a PPC 32bit fix and a local denial of service fix.

Since the 2.6.17 kernel release Linus's tree has started filling up very quickly, with almost 1000 patches being accepted in 3 days. These changes include a very large MIPS update, ARM update, SCSI update, PCI Hotplug update, wireless networking update, network update, network driver update, userspace cleanup of the header files, and a firewire update.

Comments (4 posted)

Kernel development news

Quote of the week

The thing is, I don't actually enjoy debugging my own machines. I _much_ prefer having other people debug _their_ machines, and fixing my machine in the process.

-- Linus Torvalds

Comments (1 posted)

KHB: Transparent support for large pages

June 19, 2006

This article was contributed by Valerie Aurora

Introduction: The Kernel Hacker's Bookshelf

A lot of great operating systems research goes on, but relatively little of it makes the leap into production operating systems, or from one operating system to another. The ideas that do trickle down into implementation are often delayed by years. Usually an idea gets ignored because it looked good in the research lab but turned out not to be practical in a production environment. But every so often, a practical idea goes unnoticed for years simply because none of the actual coders has the time to sit down and parse fifteen pages of dry academic prose. You're too busy writing code, can't someone make it easier to figure out which books and papers are worth reading?

Welcome to The Kernel Hacker's Bookshelf. The goal of this series is to bring good research and good kernel hackers together through reviews focusing on the practical aspects of research, written in plain (possibly even entertaining) language. We hope you enjoy reading these articles - and writing code inspired by them!

Transparent operating systems support for large pages

While Moore's Law tramped inexorably on during the last few decades, increasing memory size and disk space along with transistor density, it left some elements of computer architecture in the dust. One of these stragglers is TLB coverage. The TLB (or Translation Look-aside Buffer) caches translations between virtual and physical memory addresses; usually every memory access requires a translation. Performance is best when all the needed translations can fit in the TLB and translations "hit" in the TLB instead of missing. The amount of memory translated by the entries in the TLB is called the TLB coverage. TLB coverage has been dropping as a fraction of total memory (and, more importantly, as a fraction of the total size of Netscape - er, Mozilla - er, Firefox), and TLB misses are often a serious drag on system performance.

Since translations are done on a per-page basis, one solution is to increase the size of the system pages. We could increase the base page size - the smallest page available - but that would typically waste a lot of memory, cause more page-outs, trigger unexpected application bugs (ask me about the one with the JVM default stack size and 64KB pages some time), and make the system slower overall. Instead, many processors now offer multiple page sizes, beginning with a base page size of 4KB or 8KB and ranging up to a large page size of 2MB or occasionally a truly monstrous page size of 256MB or larger. Large pages increase TLB coverage and can reduce TLB misses significantly, often improving the performance of applications with large working sets by 10-15%. On the other hand, large pages can reduce performance by increasing the cost of paging memory in and out and adding the overhead of tracking several different page sizes. Implementing automatic, transparent OS-level support for large pages while simultaneously improving overall performance is not easy. It's also what Linux users are clamoring for - and some of them are switching to operating systems that already have automatic large page support (cough, cough, Solaris).

A Solution: The Rice Paper

Practical, Transparent Operating System Support for Superpages by Juan Navarro, et al., describes a sophisticated and elegant implementation of transparent large pages. The authors implemented their system on FreeBSD on the Alpha processor, using 4 page sizes: 8KB, 64KB, 512KB, and 4MB. The paper was published in 2002, otherwise they might have picked a less ill-omened architecture than Alpha; fortunately the design is reasonably generic. Overall, this paper is one of the best I've ever read.

The basic design is reservation-based; that is, enough pages to make a large page are reserved in advance and later promoted to a large page when justified. Memory fragmentation is reined in via careful grouping of page types and a smarter page replacement policy. Almost all applications tested saw at least some speed-up, and absolute worst case performance degradation varied from 2-9%. Most amazing of all, the implementation only required about 3500 lines of code - about half of an ext2. How exactly did they accomplish all this? Buckle up for some nitty-gritty details.

First, a run of contiguous pages suitable for a large page is reserved whenever an application page fault occurs (outside an existing reservation, of course). The size of the reservation is picked based on the size and type of the memory object, with slight variations depending on whether the object is fixed in size (e.g., text) or might grow (e.g., stack). For example, an application with 700KB of text would have a 512KB page reserved the first time a page in the text was faulted into memory. Once a large page of any size has been fully populated (all of its pages referenced at least once), it is promoted into a large page. In our example, once a contiguous 64KB region anywhere in the program text has been faulted in, it will be promoted to a 64KB page. Promotion of a partially populated page is possible, but the trade off is that it may increase the application's total memory usage, unintentionally creating a memory hog.

In the rough and tumble world of scarce memory, promotion is not a one-way street. Demotion of large pages into smaller pages is also useful. An application may start out using all pages in a large page but then stop referencing most of the pages. The only way to tell is to demote the large page and check the referenced bits on the smaller pages a little while later. A page is demoted when it is first written, when one of its base pages is evicted, and periodically when the system is under memory pressure.

When an application wants more memory and no free space is available, unused parts of a reservation are preempted. "Use it or lose it" is the name of the game here. The reservation which loses is the one whose most recent allocation occurred least recently - LRU order, basically - since most applications touch most of their working set soon after starting up, and so it's unlikely the original owner of the reservation will need the space. Unused reservations live on different lists depending on the size of the allocation that can be made by preempting the reservation. A population map, implemented as a radix tree, keeps track of which pages are allocated inside each large page-sized extent for easy look up. This radix tree is a key data structure; it makes allocation, reservation, and promotion decisions fast and simple.

The final key elements are the page replacement policy and the way pages of various types are grouped together. There are several different kinds of pages in the system. Some pages can't be moved or freed (pinned), some pages are in use but can be moved (active), and some pages are not currently used by anyone but may be used in the future (cached and/or inactive). If these pages are mixed together indiscriminately, pinned and active pages end up scattered everywhere, without any contiguous runs of free (or free-able) pages that can be converted into hotly pursued large pages. Fragmentation needs to be both prevented and repaired - without hurting performance by moving around pages too much.

Pinned pages are the most difficult problem, since once allocated they cannot be moved and may never be freed. The system tries to allocate these pages in clusters, so they break up as few potential large pages as possible. Similarly, cache pages are allocated in clusters with free pages, since cached pages can be easily freed to allow the creation of a large page. Reservations can include cache pages, and cached pages contained inside a reservation continue to be active until the application actually needs to kick that page out.

The page replacement daemon was changed to run not only when free memory runs low, but also when contiguity runs low. An "innocent until proven guilty" algorithm works here - we assume we don't need more contiguity until a large page reservation fails for lack of contiguity. When woken for this reason, the daemon runs just long enough to recover enough contiguous space to satisfy the allocations that failed. The page aging algorithm was changed slightly from the FreeBSD default; cached pages for a file are marked inactive on the last close, trading off the chance of the file being reopened against the opportunity for more contiguity.

Evaluating the System

The authors tested their system against a truly startling variety of applications, everything from gzip to web server trace replays to fast Fourier transforms, as well as a section exploring worst case situations. Personally, I'm not sure I've ever seen a better evaluation in a research paper; it's quite a treat to read.

In the best case, with low fragmentation, 33 out of 35 applications showed some improvement (one was unchanged, and the other was about 2% slower). Several had significant improvements. For example, rotating an image using ImageMagick was about 20% faster; linking the FreeBSD kernel was about 30% faster; bzip2 was 14% faster. In the fragmented case, performance was not as good, but usually to picked up again after a few runs as the page replacement daemon moved things around. In the worst-case department, the performance was degraded by about 9% for an application that only touched one byte per large page before freeing it, and by about 2% for a test case in which large page promotion was turned off. It makes for a pretty convincing case that large pages are an overall win for many systems.

Implications for Linux

What does this paper tell us? It is possible to implement transparent large page support in such a way that most applications get at least some benefit, and some applications get a lot of benefit. The algorithms used are relatively simply to understand and implement, and hold up well in worst case behavior. Finally, transparent large pages can be implemented elegantly and cleanly - only 3500 lines of code! Best of all, this paper includes a plethora of implementation details and smart algorithms, just begging to be reused. All of the above earns this paper a hallowed place on the Kernel Hacker's Bookshelf.

Over the past few years, several Linux developers have been working on various forms of transparent large page support. Some of that recent work, spearheaded by Mel Gorman, has been reviewed earlier in LWN:

Current work on large pages in Linux is summarized on the linux-mm wiki.

I look forward to more work in this fascinating and fertile area of operating systems implementation.

[Do you have a favorite textbook or systems paper? Of course you do. Send your suggestions to:

val dot henson at gmail dot com

Valerie Henson is a Linux kernel developer working for Intel. Her interests include file systems, networking, women in computing, and walking up and down large mountains. She is always looking for good systems programmers, so send her some email and introduce yourself.]

Comments (19 posted)

Driver core finally changing

June 21, 2006

This article was contributed by Greg Kroah-Hartman.

Back in November of last year, I wrote a list of the steps that were going to happen for the future of the kernel driver core. Finally, some of the steps that were described there have been implemented.

Making struct class_device go away

In the -mm kernel tree, there is a small patch that allows almost all users within the kernel of the struct class_device structure to convert over to use a struct device structure instead. This patch changes the struct device structure by adding the following fields:
struct device_attribute *devt_attr;
struct list_head        node;
struct class            *class;
dev_t                   devt;

The first two fields, devt_attr and node are used internally by the driver core code, and should not be touched by anything else. The other two fields class and devt are what is used by any code wishing to convert to the struct device structure.

If the field class is set by someone, before the struct device is registered, the driver core assumes that this struct device is associated with the specified struct class. This means that the device is added to the list of all devices attached to that class, and a symlink is created in the class's directory in sysfs, showing that it is present.

If the field devt is set, then a file named dev is created in the sysfs directory for the device, containing the major and minor number of the device. This is what programs like udev use in order to properly set up the /dev tree dynamically depending on what devices are present in the system.

As an example of what the sysfs changes are when these fields are set, look at the usb_device class code that has been converted to use this new interface in the latest -mm release.

The /sys/class/usb_device directory in the 2.6.17 kernel release looked something like this for most systems:

$ tree /sys/class/usb_device/
/sys/class/usb_device/
|-- usbdev1.1
|   |-- dev
|   |-- device -> ../../../devices/pci0000:00/0000:00:1d.7/usb1
|   `-- uevent
|-- usbdev2.1
|   |-- dev
|   |-- device -> ../../../devices/pci0000:00/0000:00:1d.0/usb2
|   `-- uevent
|-- usbdev3.1
|   |-- dev
|   |-- device -> ../../../devices/pci0000:00/0000:00:1d.1/usb3
|   `-- uevent
|-- usbdev4.1
|   |-- dev
|   |-- device -> ../../../devices/pci0000:00/0000:00:1d.2/usb4
|   `-- uevent
`-- usbdev4.3
    |-- dev
    |-- device -> ../../../devices/pci0000:00/0000:00:1d.2/usb4/4-1
    `-- uevent
But now, converted over to use the struct device structure instead of struct class_device, it looks like:
/sys/class/usb_device/
|-- usbdev1.1 -> ../../../devices/pci0000:00/0000:00:1d.7/usb1/usbdev1.1
|-- usbdev2.1 -> ../../../devices/pci0000:00/0000:00:1d.0/usb2/usbdev2.1
|-- usbdev3.1 -> ../../../devices/pci0000:00/0000:00:1d.1/usb3/usbdev3.1
|-- usbdev4.1 -> ../../../devices/pci0000:00/0000:00:1d.2/usb4/usbdev4.1
`-- usbdev4.3 -> ../../../devices/pci0000:00/0000:00:1d.2/usb4/4-1/usbdev4.3
What this has accomplished is to move the USB device structure that used to be sitting out in the class directory, into the device tree itself in sysfs, providing a unified device tree, without needing to look in two different locations in sysfs to find the information.

Helper functions

In order to make the transition to converting existing kernel code to use the struct device structure instead of the struct class_device structure, two new functions have been introduced into the driver core:
struct device *device_create(struct class *cls, struct device *parent,
                             dev_t devt, char *fmt, ...)
                             __attribute__((format(printf,4,5)));
void device_destroy(struct class *cls, dev_t devt);

The function device_create, works almost identically to the existing kernel function, class_device_create. It dynamically creates a struct device structure, with all of the specified information, and registers it with the driver core and sysfs.

The other new function, device_destroy, is used to remove any struct device structures that were created with a call to device_create earlier. It is almost identical to the existing function, class_device_destroy

An example of how simple it is to convert existing code can be seen in the patch that does the conversion for the usb_device class code.

Slowly over time, all users of struct class_device will be converted over to using struct device and then, struct class_device will be removed from the kernel. And hopefully, the other tasks outlined in that original article , will also get accomplished.

Comments (none posted)

More sysfs symlinks

June 21, 2006

This article was contributed by Greg Kroah-Hartman.

Another new change in sysfs that will be in the 2.6.18 kernel release, is the addition of another symlink in all device and class device directories. Kay Sievers has written a patch that adds the symlink "subsystem" to these directories. This symlink points back to either the class that the device is associated with, or the bus that the device is associated with.

This symlink is identical to the information that the kernel has always been emitting to userspace through the hotplug interface whenever a device was created or removed from the system. Userspace uses the subsystem information in order to determine what to do with the device.

If you look at the older hotplug package, it is broken down into a set of different scripts that run depending on the subsystem that is being addressed:

$ ls /etc/hotplug/*.rc
/etc/hotplug/input.rc
/etc/hotplug/isapnp.rc
/etc/hotplug/pci.rc
/etc/hotplug/pnp.rc
/etc/hotplug/usb.rc
And udev rules also act on the subsystem type in order to determine what to do with the device:
$ head -n 3 /etc/udev/rules.d/05-udev-early.rules
# ignore these events until someone needs them
SUBSYSTEM=="drivers",   OPTIONS="ignore_device"
SUBSYSTEM=="module",    OPTIONS="ignore_device"
But before this kernel patch, if a program wanted to walk through sysfs and try to determine the subsystem that a specific device was associated with, they had to do the following steps:
  • If this is a device, look for the bus symlink and follow it.
  • If this is a class device, go up a directory and see if this is a class directory. If not, go up another directory, until the class is found.
Now, with the subsystem symlink, this logic can be greatly simplified, as only this symlink needs to be followed in order to determine the subsystem that the device is associated with:
$ tree /sys/class/tty/ttyS0/
/sys/class/tty/ttyS0/
|-- dev
|-- device -> ../../../devices/platform/serial8250
|-- subsystem -> ../../../class/tty
`-- uevent

$ tree /sys/devices/pci0000:00/0000:00:00.0/
/sys/devices/pci0000:00/0000:00:00.0/
|-- broken_parity_status
|-- bus -> ../../../bus/pci
|-- class
|-- config
|-- device
|-- driver -> ../../../bus/pci/drivers/e752x_edac
|-- enable
|-- irq
|-- local_cpus
|-- modalias
|-- power
|   |-- state
|   `-- wakeup
|-- resource
|-- subsystem -> ../../../bus/pci
|-- subsystem_device
|-- subsystem_vendor
|-- uevent
`-- vendor

Comments (3 posted)

Detecting kernel memory leaks

The implementation language for the Linux kernel is C. That choice makes a great deal of sense; C does a good job of staying out of the way and letting programmers control exactly what is happening. Anybody who does any significant amount of C programming, however, eventually ends up chasing down memory leaks. Since C forces programmers to track every block of allocated memory and clean up their own messes, things occasionally slip through the cracks. Memory leaks can be a problem in applications, especially those which run for a long time - ask any Firefox user. But kernel memory leaks are worse; every time the kernel drops a piece of memory, it is gone until the next boot. A system with a serious kernel memory leak will quickly become unusable.

Tracking down memory leaks can be painful work. When a proprietary memory allocation tracking tool became available for SunOS many years ago, your editor had no qualms about spending thousands of his employer's dollars to license it; the payback time was quite short. In current times, Linux users can employ a free tool like valgrind (version 3.2.0 was released on June 8) to track down user-space memory leaks. But valgrind does not work on a running kernel. (Some work has been done on running User-mode Linux under valgrind, but sometimes one simply has to debug the host system).

As the kernel developers rely more heavily on automated tools for finding bugs, the creation of a kernel memory leak detector is an obvious next step. Catalin Marinas has taken that step with a kernel memory leak detector patch series. This code, if accepted into the kernel, should help to eliminate another big class of errors.

Catalin's patch functions much like a scan-and-mark garbage collector. The first step is to track every memory allocation in the system; to that end, the patch instruments the slab allocator. Every block allocated from a slab (which will include allocations from kmalloc()) is stored in a radix tree; along with a pointer to the block, the stored information includes the block size and a stack trace identifying where the block was allocated. When blocks are freed, their corresponding entries are removed from the radix tree.

During normal system operation, this radix tree just sits there. Should somebody ask about memory leaks (by reading /sys/kernel/debug/memleak), the detection algorithm swings into action. The steps performed are:

  • A big list is created holding every outstanding memory allocation in the system. This list is called the "white" list; everything on it is considered to be a possible memory leak.

  • Various parts of memory are scanned for pointers which match the allocated blocks; every time such a pointer is found, the block is moved to the "gray" list of memory which is still reachable, and thus not leaked. The initial scan includes the kernel's static data areas, each process's kernel stack, and each processor's per-CPU variable data area.

  • The first scan finds all memory referenced directly from static memory, but kernel data structures are more complicated than that. So, each block which has been put onto the gray list is scanned as well. Most of these blocks will be structures allocated from a slab cache, and they may contain pointers to other structures. So each block is queried, paying attention to that block's remembered size. Any pointers found within the block are moved over to the gray list, and scanned in turn. There is, of course, a provision for remembering which blocks have been scanned and avoiding infinite loops.

  • Once all pointers on the gray list have been scanned, every block of memory reachable by the kernel has been located. Anything remaining on the white list is considered to be leaked, and the relevant information is sent back to user space.

In the real world, things get complicated, so the leak detector is not quite as simple as described above. One situation which had to be addressed is cases where the kernel keeps a pointer to the interior of a block of memory, rather than to the beginning. This happens frequently; many kernel structures are located by way of an embedded list_head structure or kobject, for example. As a way of locating these blocks, the memory leak detector records uses of the container_of() macro; in particular, it remembers the size of the block and the offset to the embedded structure. When a block of a given size is allocated, the detector records "alias" addresses for any possible embedded structures. A pointer to one of those aliases is considered to be equivalent to a pointer to the beginning of the block.

There are various other special cases which must be handled. For example, memory obtained from vmalloc() will be pointed to by the memory allocation code itself, but might still be leaked. In other cases, memory is allocated which cannot be found by the scanning algorithm; a number of special annotations are added to the kernel to suppress the resulting false positive reports. The detector can also be fooled by pointers which are left behind in disused memory, or by random data which happens to look like a pointer to an allocated block; in these cases, false-negatives will result.

Even with these problems, the situation is better than before - a lot of memory leak situations can be found. Ingo Molnar, however, has a vision of a more ambitious scheme wherein type information for every allocated block would be retained. Among other things, this information would allow the scanning to be restricted to parts of the block known to contain pointers; that should speed the process and reduce false negatives. Since type information is available, each scanned pointer could be checked to ensure that it points to a block of the correct type, adding another level of checking to the kernel. Implementing all of this looks like a big task, however; even Ingo may need a couple of days to get it done.

Comments (22 posted)

Patches and updates

Kernel trees

Con Kolivas 2.6.17-ck1 patches ?
Chris Wright Linux 2.6.17.1 ?

Architecture-specific

Core kernel code

Peter Williams sched: Add CPU rate caps ?
David Howells frv: __user infrastructure ?

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous

Page editor: Forrest Cook
Next page: Distributions>>


Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds