Release status
Kernel release status
The stable kernel team released the
2.6.17.1 kernel
with a security fix for the SCTP network protocol, and the
2.6.16.21 kernel
with the same SCTP fix, and a PPC 32bit fix and a local
denial of service fix.
Since the 2.6.17 kernel
release Linus's tree has started filling
up very quickly, with almost 1000 patches being accepted in 3
days. These changes include a very large MIPS update, ARM
update, SCSI update, PCI Hotplug update, wireless networking
update, network update, network driver update, userspace cleanup
of the header files, and a firewire update.
Comments (4 posted)
Kernel development news
Quote of the week
The thing is, I don't actually enjoy debugging my own machines. I
_much_ prefer having other people debug _their_ machines, and fixing my
machine in the process.
-- Linus Torvalds
Comments (1 posted)
KHB: Transparent support for large pages
June 19, 2006
This article was contributed by Valerie Henson
Introduction: The Kernel Hacker's Bookshelf
A lot of great operating systems research goes on, but relatively
little of it makes the leap into production operating systems, or from
one operating system to another. The ideas that do trickle down into
implementation are often delayed by years. Usually an idea gets
ignored because it looked good in the research lab but turned out not
to be practical in a production environment. But every so often, a
practical idea goes unnoticed for years simply because none of the
actual coders has the time to sit down and parse fifteen pages of dry
academic prose. You're too busy writing code, can't someone make it
easier to figure out which books and papers are worth reading?
Welcome to The Kernel Hacker's Bookshelf. The goal of this series is
to bring good research and good kernel hackers together through
reviews focusing on the practical aspects of research, written in
plain (possibly even entertaining) language. We hope you enjoy
reading these articles - and writing code inspired by them!
Transparent operating systems support for large pages
While Moore's Law tramped inexorably on during the last few decades,
increasing memory size and disk space along with transistor density,
it left some elements of computer architecture in the dust. One of
these stragglers is TLB coverage. The TLB (or Translation Look-aside
Buffer) caches translations between virtual and physical memory
addresses; usually every memory access requires a translation.
Performance is best when all the needed translations can fit in the
TLB and translations "hit" in the TLB instead of missing. The amount
of memory translated by the entries in the TLB is called the TLB
coverage. TLB coverage has been dropping as a fraction of total
memory (and, more importantly, as a fraction of the total size of
Netscape - er, Mozilla - er, Firefox), and TLB misses are often a
serious drag on system performance.
Since translations are done on a per-page basis, one solution is to
increase the size of the system pages. We could increase the base
page size - the smallest page available - but that would typically
waste a lot of memory, cause more page-outs, trigger unexpected
application bugs (ask me about the one with the JVM default stack size
and 64KB pages some time), and make the system slower overall.
Instead, many processors now offer multiple page sizes, beginning with
a base page size of 4KB or 8KB and ranging up to a large page size of
2MB or occasionally a truly monstrous page size of 256MB or larger.
Large pages increase TLB coverage and can reduce TLB misses
significantly, often improving the performance of applications with
large working sets by 10-15%. On the other hand, large pages can
reduce performance by increasing the cost of paging memory in and out
and adding the overhead of tracking several different page sizes.
Implementing automatic, transparent OS-level support for large pages
while simultaneously improving overall performance is not easy. It's
also what Linux users are clamoring for - and some of them are
switching to operating systems that already have automatic large page
support (cough, cough, Solaris).
A Solution: The Rice Paper
Practical,
Transparent Operating System Support for Superpages by Juan
Navarro, et al., describes a sophisticated and elegant implementation
of transparent large pages. The authors implemented their system on
FreeBSD on the Alpha processor, using 4 page sizes: 8KB, 64KB, 512KB,
and 4MB. The paper was published in 2002, otherwise they might have
picked a less ill-omened architecture than Alpha; fortunately the
design is reasonably generic. Overall, this paper is one of the best
I've ever read.
The basic design is reservation-based; that is, enough pages to make a
large page are reserved in advance and later promoted to a large page
when justified. Memory fragmentation is reined in via careful
grouping of page types and a smarter page replacement policy. Almost
all applications tested saw at least some speed-up, and absolute worst
case performance degradation varied from 2-9%. Most amazing of all,
the implementation only required about 3500 lines of code - about half
of an ext2. How exactly did they accomplish all this? Buckle up for
some nitty-gritty details.
First, a run of contiguous pages suitable for a large page is reserved
whenever an application page fault occurs (outside an existing
reservation, of course). The size of the reservation is picked based
on the size and type of the memory object, with slight variations
depending on whether the object is fixed in size (e.g., text) or might
grow (e.g., stack). For example, an application with 700KB of text
would have a 512KB page reserved the first time a page in the text was
faulted into memory. Once a large page of any size has been fully
populated (all of its pages referenced at least once), it is promoted
into a large page. In our example, once a contiguous 64KB region
anywhere in the program text has been faulted in, it will be promoted
to a 64KB page. Promotion of a partially populated page is possible,
but the trade off is that it may increase the application's total
memory usage, unintentionally creating a memory hog.
In the rough and tumble world of scarce memory, promotion is not a
one-way street. Demotion of large pages into smaller pages is also
useful. An application may start out using all pages in a large page
but then stop referencing most of the pages. The only way to tell is
to demote the large page and check the referenced bits on the smaller
pages a little while later. A page is demoted when it is first
written, when one of its base pages is evicted, and periodically when
the system is under memory pressure.
When an application wants more memory and no free space is available,
unused parts of a reservation are preempted. "Use it or lose it" is
the name of the game here. The reservation which loses is the one
whose most recent allocation occurred least recently - LRU order,
basically - since most applications touch most of their working set
soon after starting up, and so it's unlikely the original owner of the
reservation will need the space. Unused reservations live on
different lists depending on the size of the allocation that can be
made by preempting the reservation. A population map, implemented as
a radix tree, keeps track of which pages are allocated inside each
large page-sized extent for easy look up. This radix tree is a key
data structure; it makes allocation, reservation, and promotion
decisions fast and simple.
The final key elements are the page replacement policy and the way
pages of various types are grouped together. There are several
different kinds of pages in the system. Some pages can't be moved or
freed (pinned), some pages are in use but can be moved (active), and
some pages are not currently used by anyone but may be used in the
future (cached and/or inactive). If these pages are mixed together
indiscriminately, pinned and active pages end up scattered everywhere,
without any contiguous runs of free (or free-able) pages that can be
converted into hotly pursued large pages. Fragmentation needs to be
both prevented and repaired - without hurting performance by moving
around pages too much.
Pinned pages are the most difficult problem, since once allocated they
cannot be moved and may never be freed. The system tries to allocate
these pages in clusters, so they break up as few potential large pages
as possible. Similarly, cache pages are allocated in clusters with
free pages, since cached pages can be easily freed to allow the
creation of a large page. Reservations can include cache pages, and
cached pages contained inside a reservation continue to be active
until the application actually needs to kick that page out.
The page replacement daemon was changed to run not only when free
memory runs low, but also when contiguity runs low. An "innocent
until proven guilty" algorithm works here - we assume we don't need
more contiguity until a large page reservation fails for lack of
contiguity. When woken for this reason, the daemon runs just long
enough to recover enough contiguous space to satisfy the allocations
that failed. The page aging algorithm was changed slightly from the
FreeBSD default; cached pages for a file are marked inactive on the
last close, trading off the chance of the file being reopened against
the opportunity for more contiguity.
Evaluating the System
The authors tested their system against a truly startling variety of
applications, everything from gzip to web server trace replays to fast
Fourier transforms, as well as a section exploring worst case
situations. Personally, I'm not sure I've ever seen a better
evaluation in a research paper; it's quite a treat to read.
In the best case, with low fragmentation, 33 out of 35 applications
showed some improvement (one was unchanged, and the other was about 2%
slower). Several had significant improvements. For example, rotating
an image using ImageMagick was about 20% faster; linking the FreeBSD
kernel was about 30% faster; bzip2 was 14% faster. In the fragmented
case, performance was not as good, but usually to picked up again
after a few runs as the page replacement daemon moved things around.
In the worst-case department, the performance was degraded by about 9%
for an application that only touched one byte per large page before
freeing it, and by about 2% for a test case in which large page
promotion was turned off. It makes for a pretty convincing case that
large pages are an overall win for many systems.
Implications for Linux
What does this paper tell us? It is possible to implement transparent
large page support in such a way that most applications get at least
some benefit, and some applications get a lot of benefit. The
algorithms used are relatively simply to understand and implement, and
hold up well in worst case behavior. Finally, transparent large pages
can be implemented elegantly and cleanly - only 3500 lines of code!
Best of all, this paper includes a plethora of implementation details
and smart algorithms, just begging to be reused. All of the above
earns this paper a hallowed place on the Kernel Hacker's Bookshelf.
Over the past few years, several Linux developers have been working on
various forms of transparent large page support. Some of that recent
work, spearheaded by Mel Gorman, has been reviewed earlier in LWN:
Current work on large pages in Linux is summarized on the
linux-mm wiki.
I look forward to more work in this fascinating and fertile area of
operating systems implementation.
[Do you have a favorite textbook or systems paper? Of course you do.
Send your suggestions to:
val dot henson at gmail dot com
Valerie Henson is a Linux kernel
developer working for Intel. Her interests include file systems,
networking, women in computing, and walking up and down large
mountains. She is always looking for good systems programmers, so
send her some email and introduce yourself.]
Comments (19 posted)
Driver core finally changing
Back in November of last year, I wrote
a list of the steps that were
going to happen for the future of the kernel driver core. Finally,
some of the steps that were described there have been implemented.
Making struct class_device go away
In the -mm kernel tree, there is a small patch that allows almost all
users within the kernel of the
struct class_device structure to
convert over to use a
struct device structure instead. This
patch changes the
struct device structure by adding the
following fields:
struct device_attribute *devt_attr;
struct list_head node;
struct class *class;
dev_t devt;
The first two fields, devt_attr and node are used
internally by the driver core code, and should not be touched by
anything else. The other two fields class and devt
are what is used by any code wishing to convert to the struct
device structure.
If the field class is set by someone, before the struct
device is registered, the driver core assumes that this struct
device is associated with the specified struct class.
This means that the device is added to the list of all devices attached
to that class, and a symlink is created in the class's directory in
sysfs, showing that it is present.
If the field devt is set, then a file named dev is
created in the sysfs directory for the device, containing the major and
minor number of the device. This is what programs like udev
use in order to properly set up the /dev tree dynamically
depending on what devices are present in the system.
As an example of what the sysfs changes are when these fields are set,
look at the usb_device class code that has been converted to use
this new interface in the latest -mm release.
The /sys/class/usb_device directory in the 2.6.17 kernel release
looked something like this for most systems:
$ tree /sys/class/usb_device/
/sys/class/usb_device/
|-- usbdev1.1
| |-- dev
| |-- device -> ../../../devices/pci0000:00/0000:00:1d.7/usb1
| `-- uevent
|-- usbdev2.1
| |-- dev
| |-- device -> ../../../devices/pci0000:00/0000:00:1d.0/usb2
| `-- uevent
|-- usbdev3.1
| |-- dev
| |-- device -> ../../../devices/pci0000:00/0000:00:1d.1/usb3
| `-- uevent
|-- usbdev4.1
| |-- dev
| |-- device -> ../../../devices/pci0000:00/0000:00:1d.2/usb4
| `-- uevent
`-- usbdev4.3
|-- dev
|-- device -> ../../../devices/pci0000:00/0000:00:1d.2/usb4/4-1
`-- uevent
But now, converted over to use the
struct device structure
instead of
struct class_device, it looks like:
/sys/class/usb_device/
|-- usbdev1.1 -> ../../../devices/pci0000:00/0000:00:1d.7/usb1/usbdev1.1
|-- usbdev2.1 -> ../../../devices/pci0000:00/0000:00:1d.0/usb2/usbdev2.1
|-- usbdev3.1 -> ../../../devices/pci0000:00/0000:00:1d.1/usb3/usbdev3.1
|-- usbdev4.1 -> ../../../devices/pci0000:00/0000:00:1d.2/usb4/usbdev4.1
`-- usbdev4.3 -> ../../../devices/pci0000:00/0000:00:1d.2/usb4/4-1/usbdev4.3
What this has accomplished is to move the USB device structure that used
to be sitting out in the class directory, into the device tree itself in
sysfs, providing a unified device tree, without needing to look in two
different locations in sysfs to find the information.
Helper functions
In order to make the transition to converting existing kernel code to
use the
struct device structure instead of the
struct
class_device structure, two new functions have been introduced into
the driver core:
struct device *device_create(struct class *cls, struct device *parent,
dev_t devt, char *fmt, ...)
__attribute__((format(printf,4,5)));
void device_destroy(struct class *cls, dev_t devt);
The function device_create, works almost identically to the
existing kernel function, class_device_create. It dynamically
creates a struct device structure, with all of the specified
information, and registers it with the driver core and sysfs.
The other new function, device_destroy, is used to remove any
struct device structures that were created with a call to
device_create earlier. It is almost identical to the existing
function, class_device_destroy
An example of how simple it is to convert existing code can be seen in
the patch that does the conversion for the usb_device class
code.
Slowly over time, all users of struct class_device will be
converted over to using struct device and then, struct
class_device will be removed from the kernel. And hopefully, the
other tasks outlined in that
original article
, will also get accomplished.
Comments (none posted)
More sysfs symlinks
Another new change in sysfs that will be in the 2.6.18 kernel release, is
the addition of another symlink in all device and class device directories.
Kay Sievers has written a patch that adds the symlink "
subsystem"
to these directories. This symlink points back to either the class that
the device is associated with, or the bus that the device is associated
with.
This symlink is identical to the information that the kernel has always
been emitting to userspace through the hotplug interface whenever a device
was created or removed from the system. Userspace uses the subsystem
information in order to determine what to do with the device.
If you look at the older hotplug package, it is broken down into a set of
different scripts that run depending on the subsystem that is being
addressed:
$ ls /etc/hotplug/*.rc
/etc/hotplug/input.rc
/etc/hotplug/isapnp.rc
/etc/hotplug/pci.rc
/etc/hotplug/pnp.rc
/etc/hotplug/usb.rc
And udev rules also act on the subsystem type in order to determine what to
do with the device:
$ head -n 3 /etc/udev/rules.d/05-udev-early.rules
# ignore these events until someone needs them
SUBSYSTEM=="drivers", OPTIONS="ignore_device"
SUBSYSTEM=="module", OPTIONS="ignore_device"
But before this kernel patch, if a program wanted to walk through sysfs and
try to determine the subsystem that a specific device was associated with,
they had to do the following steps:
- If this is a device, look for the bus symlink and follow
it.
- If this is a class device, go up a directory and see if this is a class
directory. If not, go up another directory, until the class is found.
Now, with the subsystem symlink, this logic can be greatly simplified, as
only this symlink needs to be followed in order to determine the subsystem
that the device is associated with:
$ tree /sys/class/tty/ttyS0/
/sys/class/tty/ttyS0/
|-- dev
|-- device -> ../../../devices/platform/serial8250
|-- subsystem -> ../../../class/tty
`-- uevent
$ tree /sys/devices/pci0000:00/0000:00:00.0/
/sys/devices/pci0000:00/0000:00:00.0/
|-- broken_parity_status
|-- bus -> ../../../bus/pci
|-- class
|-- config
|-- device
|-- driver -> ../../../bus/pci/drivers/e752x_edac
|-- enable
|-- irq
|-- local_cpus
|-- modalias
|-- power
| |-- state
| `-- wakeup
|-- resource
|-- subsystem -> ../../../bus/pci
|-- subsystem_device
|-- subsystem_vendor
|-- uevent
`-- vendor
Comments (3 posted)
Detecting kernel memory leaks
The implementation language for the Linux kernel is C. That choice makes a
great deal of sense; C does a good job of staying out of the way and
letting programmers control exactly what is happening. Anybody who does
any significant amount of C programming, however, eventually ends up
chasing down memory leaks. Since C forces programmers to track every block
of allocated memory and clean up their own messes, things occasionally slip
through the cracks. Memory leaks can be a problem in applications,
especially those which run for a long time - ask any Firefox user. But
kernel memory leaks are worse; every time the kernel drops a piece of
memory, it is gone until the next boot. A system with a serious kernel
memory leak will quickly become unusable.
Tracking down memory leaks can be painful work. When a proprietary memory
allocation tracking tool became available for SunOS many years ago, your
editor had no qualms about spending thousands of his employer's dollars to
license it; the payback time was quite short. In current times, Linux
users can employ a free tool like valgrind
(version 3.2.0 was released
on June 8) to track down user-space memory leaks. But valgrind does
not work on a running kernel. (Some work has been done on running
User-mode Linux under valgrind, but sometimes one simply has to debug the
host system).
As the kernel developers rely more heavily on automated tools for finding
bugs, the creation of a kernel memory leak detector is an obvious next
step. Catalin Marinas has taken that step with a kernel memory leak detector
patch series. This code, if accepted into the kernel, should help to
eliminate another big class of errors.
Catalin's patch functions much like a scan-and-mark garbage collector. The
first step is to track every memory allocation in the system; to that end,
the patch instruments the slab allocator. Every block allocated from a
slab (which will include allocations from kmalloc()) is stored in
a radix tree; along with a
pointer to the block, the stored information includes the block size and a
stack trace identifying where the block was allocated. When blocks are
freed, their corresponding entries are removed from the radix tree.
During normal system operation, this radix tree just sits there. Should
somebody ask about memory leaks (by reading
/sys/kernel/debug/memleak), the detection algorithm swings into
action. The steps performed are:
- A big list is created holding every outstanding memory allocation in
the system. This list is called the "white" list; everything on it is
considered to be a possible memory leak.
- Various parts of memory are scanned for pointers which match the
allocated blocks; every time such a pointer is found, the block is
moved to the "gray" list of memory which is still reachable, and thus
not leaked. The initial scan includes the kernel's static data areas,
each process's kernel stack, and each processor's per-CPU variable
data area.
- The first scan finds all memory referenced directly from static
memory, but kernel data structures are more complicated than that.
So, each block which has been put onto the gray list is scanned as
well. Most of these blocks will be structures allocated from a slab
cache, and they may contain pointers to other structures. So each
block is queried, paying attention to that block's remembered size.
Any pointers found within the block are moved over to the gray list,
and scanned in turn.
There is, of course, a provision for remembering which blocks have
been scanned and avoiding infinite loops.
- Once all pointers on the gray list have been scanned, every block of
memory reachable by the kernel has been located. Anything remaining
on the white list is considered to be leaked, and the relevant
information is sent back to user space.
In the real world, things get complicated, so the leak detector is not
quite as simple as described above. One situation which had to be
addressed is cases where the kernel keeps a pointer to the interior of a
block of memory, rather than to the beginning. This happens frequently;
many kernel structures are located by way of an embedded list_head
structure or kobject, for example. As a way of locating these blocks, the
memory leak detector records uses of the container_of() macro; in
particular, it remembers the size of the block and the offset to the
embedded structure. When a block of a given size is allocated, the
detector records "alias" addresses for any possible embedded structures. A
pointer to one of those aliases is considered to be equivalent to a pointer
to the beginning of the block.
There are various other special cases which must be handled. For example,
memory obtained from vmalloc() will be pointed to by the memory
allocation code itself, but might still be leaked. In other cases, memory
is allocated which cannot be found by the scanning algorithm; a number of
special annotations are added to the kernel to suppress the resulting false
positive reports. The detector can also be fooled by pointers which are
left behind in disused memory, or by random data which happens to look like
a pointer to an allocated block; in these cases, false-negatives will
result.
Even with these problems, the situation is better than before - a lot of
memory leak situations can be found. Ingo Molnar, however, has a vision of a more ambitious scheme wherein
type information for every allocated block would be retained. Among other
things, this information would allow the scanning to be restricted to parts
of the block known to contain pointers; that should speed the process and
reduce false negatives. Since type information is available, each scanned
pointer could be checked to ensure that it points to a block of the correct
type, adding another level of checking to the kernel. Implementing all of
this looks like a big task, however; even Ingo may need a couple of days to
get it done.
Comments (12 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
- Jes Sorensen: mspec.
(June 19, 2006)
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Forrest Cook
Next page: Distributions>>