The current development kernel is 2.6.33-rc8
on February 12.
I think this is going to be the last -rc of the series, so please do
test it out. A number of regressions should be fixed, and while the
regression list doesn't make me _happy_, we didn't have the kind of nasty
things that went on before -rc7 and made me worried.
can be found in the
According to the latest regression report, the number of
unresolved regressions has risen to 31, the highest point yet in this
Comments (4 posted)
There _are_ things we can do though. Detect a write to the old
file and emit a WARN_ON_ONCE("you suck"). Wait a year, turn it
into WARN_ON("you really suck"). Wait a year, then remove it.
-- Feature deprecation Andrew Morton
The post-Google standard company perks - free food, on-site
exercise classes, company shuttles - make it trivial to speak only
to fellow employees in daily life. If you spend all day with your
co-workers, socialize only with your co-workers, and then come home
and eat dinner with - you guessed it - your co-worker, you might go
several years without hearing the words, "Run Solaris on my
desktop? Are you f-ing kidding me?"
Everybody takes it for granted to run megabytes of
proprietary object code, without any memory protection, attached to
an insecure public network (GSM). Who would do that with his PC on
the Internet, without a packet filter, application level gateways
and a constant flow of security updates of the software? Yet
billions of people do that with their phones all the time.
Comments (9 posted)
The kernel.org repository depends heavily on compression to keep its
storage and bandwidth expenses down. An uncompressed tarball for the
2.6.32 release weighs in at 365MB; if downloaders grabbed the data in this
format, the resulting bandwidth usage would be huge. So kernel.org does
not make uncompressed tarballs available; instead, one can choose between
versions compressed with gzip (79MB) or bzip2 (62MB). Bzip2 is the newer
choice; it took a while to catch on because the needed tools were not
widely shipped. Now, though, the folks at kernel.org are considering
making a change in the compression formats used there.
What's driving this discussion is the availability of the XZ tool, which is based on the LZMA
compression algorithm. XZ offers better compression performance -
53MB on that 2.6.32 tarball - but it suffers from a familiar problem:
the tools are not yet widely available in distributions, especially those
of the "enterprise" variety. This has led to pushback against the idea of
standardizing on XZ in the near future, as can be seen in this comment from Ted Ts'o:
Keep in mind that there are people where who are still using RHEL
3, and some of them might want to download from ftp.kernel.org. So
those people who are suggesting that we replace .gz files with .xz
on kernel.org are *really* smoking something good.
In fact, there is little pressure to replace the gzip format anytime in the
near future. Its compression performance may not be the best, but it does
have the advantage of being far faster than any of the alternatives. From
the discussion, it is fairly clear that some users care about decompression
time. What is more likely is that XZ might eventually displace files in
the bzip2 format. Then there would be a clear choice: speed and widespread
availability or the best available compression. Even that change, though,
is likely to be at least a year away; in the mean time, kernel.org will probably
carry files in all three formats.
(This discussion also included a side thread on changing the 2.6.xx
numbering scheme. Once again, though, the expected flame wars failed to
materialize. There just does not seem to be much interest in or energy for
this particular change.)
Comments (19 posted)
Linux contains a number of system calls which do complex things; they take
large structures as input, operate on significant internal state, and,
perhaps, return some sort of complicated output data. The normal status
returned from these system calls, however, is compressed down into a single
integer called errno
. Application programmers dealing with certain
subsystems (Video4Linux2 being your editor's favorite in this regard) will
all be well familiar with the process of trying to figure out what the
problem is when the kernel says only "it failed."
Andi Kleen describes the problem this way:
I always describe that as a the "ed approach to error
handling". Instead of giving a error message you just give ?. Just
? happens to be EINVAL in Linux.
My favourite example of this is the configuration of the
networking queueing disciplines, which configure complicated data
structures and algorithms and in many cases have tens of different
error conditions based on the input parameters -- and they all
just report EINVAL.
It would be nice to provide application developers with better information
than this. A brief discussion covered some of the options:
- Use printk() to put information into the system logfile.
This approach is widely used, but it bloats the kernel with string
data, risks flooding the logs, and the resulting information may not
be easily accessible to an unprivileged programmer.
- Extend specific system calls to enable them to provide richer status
information. Just adding a new version of ioctl() would
address many of the worst problems.
- Create an errno-like mechanism by which any system call could
return extended information. That information could be an error
string, some sort of special code, or, as Alan Cox suggested, a pointer to the structure
field which caused the problem.
One could certainly argue that the narrow errno mechanism is
showing its age and could use an upgrade. Any enhancements, though, would
be Linux-specific and non-POSIX, which always tends to limit their uptake.
They would also have to be lived with forever, and, thus, would require
careful design. So we're unlikely to see a solution in the mainline
anytime soon, even if somebody does take up the challenge.
Comments (9 posted)
Kernel development news
It was something of a surprise when Linus Torvalds merged kgdb—a stub to talk
to the gdb debugger—back in the 2.6.26 merge window, because of his
well-known disdain for kernel
debuggers. But there is another kernel debugging solution
that has long been out of the mainline: kdb. Jason Wessel has proposed merging the two
solutions by reworking kgdb to use the "kdb shell" underneath, which would
lead to both solutions being available for kernel hackers.
The two debuggers serve different purposes, with kdb having much less
functionality, but they both have uses. Kgdb allows source-level debugging
using gdb over a serial line, but that requires a separate system. For
systems where it is painful or impractical to set up a serial connection,
kdb may provide enough capability to debug a problem. In addition, things
like kernel modesetting (KMS) allow for additional features that kdb has
lacked. Wessel described one possibility:
A 2010 example of where kdb can be useful over kgdb is where you have a
small netbook, no serial ports etc... and you are running X and your
file system driver crashes the kernel. With kdb plus kms you can get an
opportunity to see the crash which would have otherwise been lost from
/var/log/messages because the crash was in the file system driver.
While kgdb allows access to all of the standard debugging commands that
gdb provides, kdb has a much more limited command set.
One can examine and change memory locations or registers, set
breakpoints, and get a backtrace of the stack, but those commands typically
require using addresses, rather than symbolic names. Currently, the best
reference for kdb commands comes from a developerWorks
article, though Wessel plans to change that. There is some documentation
that comes with the patches, but a command reference will
depend on exactly which pieces, if any, actually land in the mainline.
It should be noted that one of the capabilities that was removed from kdb
as part of the merger is the disassembler. It was x86 specific, and the
new code is "99% platform independent", according to the FAQ about the
merged code. Because kgdb is implemented for many architectures, rewriting
it atop kdb led to support for many more architectures for kdb. Instead of
just the x86 family, kdb now supports arm, blackfin, mips, sh, powerpc, and
In addition, kgdb and kdb can work together. From a running kgdb session,
one can use the gdb monitor command to access kdb commands. There
are several that might be helpful like
ps for a process list or dmesg to see log output.
The FAQ lists
a number of other advantages that would come from the merge,
beyond just getting kdb into the mainline so that its users no longer have to
patch their kernels, The basic
idea behind the advantages listed is to unite the users and developers of
kgdb and kdb so
that they are all pulling in the same direction, because "both kdb
and kgdb have similar needs in terms of how they integrate into the
kernel." There have been arguments in the past about which of the
two solutions is best, but, since they serve different use cases, having
both available would have another benefit: "No longer will people
have to debate which is better, kdb or kgdb, why do we have only
one... Just go use the best tool for the job."
Wessel notes that Ubuntu has enabled kgdb in recent kernels, which is
something he would like to see done by other distributions. If kdb is
available, that too could be enabled, which would make it easier for users
to access the functionality:
My other hope is that the new kdb is much easier to use in the sense
that the barrier of entry is much lower. For example, someone with a
laptop running a kernel with a kdb enabled kernel can use it as easily as:
echo kms,kbd > /sys/module/kgdboc/parameters/kgdboc
echo g > /proc/sysrq-trigger
And voila you just ran the kernel debugger.
In the example above, Wessel shows how to enable kdb (for keyboard (kbd)
and KMS operation), then trap into it
using sysrq-g (once enabled, kdb will also be invoked if there is a panic or
oops). The following three commands are kdb commands for looking at log
output, getting a stack backtrace, and continuing execution.
The patches themselves are broken up into three separate patchsets: the
first and largest adds the kdb infrastructure into kernel/debug/
kgdb.c into that directory, the second adds KMS support
along with an experimental patch to do atomic modesetting for the i915
graphics driver, and the third allows kernel debugging
via kdb or kgdb early in the boot process; starting from the point where
earlyprintk() is available.
Wessel is targeting 2.6.34 and, at least so far, the patches have been well
received. The most recent posting is version 3 of the patchset, with a
long list of changes made in response to earlier comments. Furthermore, an
RFC about the
idea last May gained a fair number of comments that clearly indicated there
was interest in kdb and merging it with the kgdb code.
Sharp-eyed readers will note some similarities between this proposal and the
recent utrace push. In both
cases, an existing debugging facility was rewritten using a new core, but
there are differences as well. Unlike utrace, the kdb/kgdb patches
directly provide some lacking user-space functionality. Whether that is
enough to overcome Torvalds's semi-hostile attitude towards kernel
debuggers—though the inclusion of kgdb would seem to indicate some
amount of softening—remains to be seen.
Comments (7 posted)
April 2005 was a bit of a tense time in the kernel development community.
The BitKeeper tool which had done so much to improve the development
process had suddenly become unavailable, and it wasn't clear what would
replace it. Then Linus appeared with a new system called git; the current
epoch of kernel development can arguably be dated from then. The opening
event of that epoch was commit 1da177e4, the changelog of which reads:
Initial git repository build. I'm not bothering with the full
history, even though we have it. We can create a separate
"historical" git archive of that later if we want to, and in the
meantime it's about 3.2GB when imported into git - space that would
just make the early git days unnecessarily complicated, when we
don't have a lot of good infrastructure for it.
Let it rip!
The community did, indeed, let it rip; some 180,000 changesets have been
added to the repository since then. Typically hundreds of thousands of
lines of code are changed with each three-month development cycle. A while
back, your editor began to wonder how much of the kernel had actually been
changed, and how much of our 2.6.33-to-be kernel dates back to 2.6.12-rc2,
which was tagged at the opening of the git era? Was there anything left of
the kernel we were building in early 2005?
Answering this question is a simple matter of bashing out some ugly scripts
and dedicating many hours of processing time. In essence, the
"git blame" command can be used to generate an annotated
version of a file which lists the last commit to change each line of code.
Those commit IDs can be summed, then associated with major version
releases. At the end of the process, one has a simple table showing the
percentage of the current kernel code base which was created for each major
release since 2.6.12. Here's what it looks like:
In summary: just over
41% 31% of the kernel tree dates
back to 2.6.12, and has not been
modified since then. Our kernel may be changing quickly, but parts of it
have not changed at all for nearly five years. Since then, we see a steady
stream of changes, with more recent kernels being more strongly represented
than the older ones. That curve will partly be a result of the general
increase in the rate of change over time; 2.6.13 had fewer than 4,000
commits, while 2.6.33 will have almost 11,000. Still, one has to wonder
what happened with 2.6.20 (5,000 commits) to cause that
release to represent less than 2% of the total code base.
Much of the really old material is interspersed with newer lines in many
files; comments and copyright notices, in particular, can go unchanged for
a very long time. The 2.6.12 top-level makefile set VERSION=2 and
PATCHLEVEL=6, and those lines have not changed since; the next
line (SUBLEVEL=33) was changed in December.
There are interesting conclusions to be found at the upper end of the graph
as well. Using this yardstick, 2.6.33 is the smallest development cycle we
have seen in the last year, even though this cycle will have replaced some
code added during the previous cycles. 4.2% of the code in 2.6.33 was
last touched in the 2.6.33 cycle, while each of the previous four kernels
(2.6.29 through 2.6.32) still represents more than 5.5% of the code to be
shipped in 2.6.33.
Another interesting exercise is to look for entire files which have not
been touched in five years. Given the amount of general churn and API
change which has happened over that time, files which have not changed at
all have a good chance of being entirely unused. Here is a full
list of files which are untouched since 2.6.12 - all 1062 of them.
- Every kernel tarball carries around drivers/char/ChangeLog, which is
mostly dedicated to documenting the mid-90's TTY exploits of Ted
Ts'o. There is only one change since 1998, and that was in 2001.
Files like this may be interesting from a historical point of view,
but they have little relevance to current kernels.
- Unsurprisingly, the documentation directory contains a great deal of
material which has not been updated in a long time. Much of it need
not change; the means by which one configures an ISA Sound Blaster
card is pretty much as it always was - assuming one can find such a
card and an ISA bus to plug it into. Similarly, Klingon language
support (Documentation/unicode.txt), Netwinder support, and such have
not seen much development activity recently, so the documentation can
be deemed to be current, if not particularly useful. All told,
41% of the documentation directory dates back to 2.6.12. There was a
big surge of
documentation work in 2.6.32; without that, a larger percentage of
this subtree would look quite old.
- Some old interfaces haven't changed in a long time, resulting in a lot
of static files in include/.
<linux/sort.h> declares sort(), which is used
in a number of places. <include/fcdevice.h> declares
alloc_fcdev(), and includes a warning that "This file
will get merged with others RSN." Much of the sunrpc interface
has remained static for a long time as well.
- Ancient code abounds in the driver tree, though, perhaps
unsurprisingly, old header files are much more common than old C
files. The ISDN driver tree has been quite static.
- Much of sound/oss has not been touched for a long time
and must be nicely filled with cobwebs by now; there hasn't been much
of a reason to touch the OSS code for some time.
- net/decnet/TODO contains a "quick list of things that need
finishing off"; it, too, hasn't been changed in the git era. One
wonders how the DECnet hackers are doing on that list...
So which subsystem is the oldest? This plot looks at the kernel subsystems
(as defined by top-level directories) and gives the percentage of 2.6.12
code in each:
The youngest subsystem, unsurprisingly, is tools/, which did not
exist prior to 2.6.29. Among subsystems which did exist in 2.6.12,
the core kernel, with about 13% code dating from that release, is the newest.
At the other end, the sound subsystem is more than
45% 2.6.12 code - the highest in the kernel. For those who are curious about
the age distribution in specific subsystems, this page contains a chart for each.
In summary: even in a code base which is evolving as rapidly as the kernel,
there is a lot of code which has not been touched - even by coding style or
white space fixes - in the last five years. Code stays around for a long
(For those who would like to play with this kind of data, the scripts used
have been folded into the gitdm repository at git://git.lwn.net/gitdm.git).
Note: this article has been edited to fix an error which overstated
the amount of 2.6.12 code remaining in the full kernel.
Comments (55 posted)
[Editor's note: this article is the first in a five-part series on the
use of huge pages with Linux. We are most fortunate to have core VM hacker
Mel Gorman as the author of these articles! The remaining installments
will appear in future LWN Weekly Editions.
One of the driving forces behind the development of Virtual
Memory (VM) was to reduce the programming burden associated with fitting
programs into limited memory. A fundamental property of VM is that the CPU
references a virtual address that is translated via a combination
of software and hardware to a physical address. This allows
information only to be paged into memory on demand (demand
paging) improving memory utilisation, allows modules to be arbitrary
placed in memory for linking at run-time and provides a mechanism for
the protection and controlled sharing of data between processes. Use of
virtual memory is so pervasive that it has been described as an one of
the engineering triumphs of the computer age [denning96] but this
indirection is not without cost.
Typically, the total number of translations required by a program
during its lifetime will require that the page tables are stored in
main memory. Due to translation, a virtual memory reference necessitates
multiple accesses to physical memory, multiplying the cost of an ordinary
memory reference by a factor depending on the page table format. To cut
the costs associated with translation, VM implementations take advantage of
the principal of locality [denning71] by storing recent
translations in a cache called the Translation Lookaside Buffer
(TLB) [casep78,smith82,henessny90]. The amount of memory that can
be translated by this cache is referred to as the "TLB reach"
and depends on the size of the page and the number of TLB entries.
Inevitably, a percentage of a program's execution time is spent accessing
the TLB and servicing TLB misses.
The amount of time spent translating addresses depends on the workload as
the access pattern determines if the TLB reach is sufficient to store all
translations needed by the application. On a miss, the exact cost depends
on whether the information necessary to translate the address is in the CPU
cache or not. To work out the amount of time spent servicing the TLB misses,
there are some simple formulas:
Cyclestlbhit = TLBHitRate * TLBHitPenalty
Cyclestlbmiss_cache = TLBMissRatecache * TLBMissPenaltycache
Cyclestlbmiss_full = TLBMissRatefull * TLBMissPenaltyfull
TLBMissCycles = Cyclestlbmiss_cache + Cycles_tlbmiss_full
TLBMissTime = (TLB Miss Cycles)/(Clock rate)
If the TLB miss time is a large percentage of overall program
execution, then the time should be invested to reduce the miss rate and
achieve better performance. One means of achieving this is to translate
addresses in larger units than the base page size, as supported by many
Using more than one page size was identified in the 1990s as one means of
reducing the time spent servicing TLB misses by increasing TLB reach. The
benefits of huge pages are twofold. The obvious performance gain is from
fewer translations requiring fewer cycles. A less obvious benefit is that
address translation information is typically stored in the L2 cache. With
huge pages, more cache space is available for application data, which means that
fewer cycles are spent accessing main memory. Broadly speaking, database
workloads will gain about 2-7% performance using huge pages whereas
scientific workloads can range between 1% and 45%.
Huge pages are not a universal gain, so transparent support for huge pages
is limited in mainstream operating systems. On some TLB implementations,
there may be different numbers of entries for small and huge pages. If the
CPU supports a smaller number of TLB entries for huge pages, it is possible
that huge pages will be slower if the workload reference pattern is very
sparse and making a small number of references per-huge-page. There may
also be architectural limitations on where in the virtual address space
huge pages can be used.
Many modern operating systems, including Linux, support huge pages in a
more explicit fashion, although this does not necessarily mandate application
change. Linux has had support for huge pages since around 2003 where it was
mainly used for large shared memory segments in database servers such as
Oracle and DB2. Early support required application modification, which was
considered by some to be a major problem. To compound the difficulties,
tuning a Linux system to use huge pages was perceived to be a difficult
task. There have been significant improvements made over the years to huge
page support in Linux and as this article will show, using huge pages today
can be a relatively painless exercise that involves no source modification.
This first article begins by installing some huge-page-related utilities
and support libraries that make tuning and using huge pages a relatively
painless exercise. It then covers the basics of how huge pages behave under
Linux and some details of concern on NUMA. The second article covers the
different interfaces to huge pages that exist in Linux. In the third article,
the different considerations to make when tuning the system are examined
as well as how to monitor huge-page-related activities in the system. The
fourth article shows how easily benchmarks for different types of application
can use huge pages without source modification. For the very curious, some
in-depth details on TLBs and measuring the cost within an application are
discussed before concluding.
1 Huge Page Utilities and Support Libraries
There are a number of support utilities and a
library packaged collectively as libhugetlbfs. Distributions
have packages, but this article assumes that
libhugetlbfs 2.7 is installed. The latest version can always be cloned
from git using the following instructions
$ git clone git://libhugetlbfs.git.sourceforge.net/gitroot/libhugetlbfs/libhugetlbfs
$ cd libhugetlbfs
$ git checkout -b next origin/next
$ make PREFIX=/usr/local
There is an install target that installs the library
and all support utilities but there are install-bin,
install-stat and install-man targets available
in the event the existing library should be preserved during installation.
The library provides support for automatically backing text, data,
heap and shared memory segments with huge pages. In addition,
this package also provides a programming API and manual pages. The
behaviour of the library is controlled by environment variables
(as described in the libhugetlbfs.7 manual page) with
a launcher utility hugectl that knows how to configure
almost all of the variables. hugeadm, hugeedit
and pagesize provide information about the system and provide
support to system administration. tlbmiss_cost.sh automatically
calculates the average cost of a TLB miss. cpupcstat and
oprofile_start.sh provide help with monitoring the current
behaviour of the system. Manual pages are available describing in further
detail each utility.
2 Huge Page Fault Behaviour
In the following articles, there will be discussions on how different type
of memory regions can be created and backed with huge pages. One important
common point between them all is how huge pages are faulted and when the
huge pages are allocated. Further, there are important differences between
shared and private mappings depending on the exact kernel version used.
In the initial support for huge pages on Linux, huge pages were faulted at the
same time as mmap() was called. This guaranteed that all references
would succeed for shared mappings once mmap() returned successfully.
Private mappings were safe until fork() was called. Once called,
it's important that the child call exec() as soon as possible
or that the huge page mappings were marked MADV_DONTFORK
with madvise() in advance. Otherwise, a Copy-On-Write
(COW) fault could result in application failure by either parent or
child in the event of allocation failure.
Pre-faulting pages drastically increases the cost of mmap() and can
perform sub-optimally on NUMA. Since 2.6.18, huge pages were faulted the
same as normal mappings when the page was first referenced. To guarantee
that faults would succeed, huge pages were reserved at the time the shared
mapping is created but private mappings do not make any reservations. This
is unfortunate as it means an application can fail without fork()
being called. libhugetlbfs handles the private mapping problem
on old kernels by using readv() to make sure the mapping is safe
to access, but this approach is less than ideal.
Since 2.6.29, reservations are made for both shared and private mappings. Shared
mappings are guaranteed to successfully fault regardless of what process accesses
For private mappings, the number of child processes is indeterminable so
only the process that creates the mapping mmap() is guaranteed to
successfully fault. When that process fork()s, two processes are
now accessing the same pages. If the child performs COW, an attempt will
be made to allocate a new page. If it succeeds, the fault successfully
completes. If the fault fails, the child gets terminated with a message
logged to the kernel log noting that there were insufficient huge pages. If
it is the parent process that performs COW, an attempt will also be made to
allocate a huge page. In the event that allocation fails, the child's pages
are unmapped and the event recorded. The parent successfully completes the
fault but if the child accesses the unmapped page, it will be terminated.
3 Huge Pages and Swap
There is no support for the paging of huge pages to backing storage.
4 Huge Pages and NUMA
On NUMA, memory can be local or remote to the CPU, with significant
penalty incurred for remote access. By default, Linux uses a node-local
policy for the allocation of memory at page fault time. This policy
applies to both base pages and huge pages. This leads to an important
consideration while implementing a parallel workload.
The thread processing some data should be the same thread that caused the
original page fault for that data. A general anti-pattern on NUMA is when
a parent thread sets up and initialises all the workload's memory areas
and then creates threads to process the data. On a NUMA system this can
result in some of the worker threads being on CPUs remote with respect
to the memory they will access. While this applies to all NUMA systems
regardless of page size, the effect can be pronounced on systems where the
split between worker threads is in the middle of a huge page incurring more
remote accesses than might have otherwise occurred.
This scenario may occur for example when using huge pages with OpenMP,
because OpenMP does not necessarily divide its data on page boundaries.
This could lead to problems when using base pages, but the problem is
more likely with huge pages because a single huge page will cover
more data than a base page, thus making it more likely any given huge
page covers data to be processed by different threads. Consider the
following scenario. A first thread to touch a page will fault the full
page's data into memory local to the CPU on which the thread is running.
When the data is not split on huge-page-aligned boundaries, such a thread
will fault its data and perhaps also some data that is to be processed by
another thread, because the two threads' data are within the range of the
same huge page. The second thread will fault the rest of its data into
local memory, but will still have part of its data accesses be remote.
This problem manifests as large standard deviations in performance when
doing multiple runs of the same workload with the same input data.
Profiling in such a case may show there are more cross-node accesses
with huge pages than with base pages. In extreme circumstances, the
performance with huge pages may even be slower than with base pages.
For this reason it is important to consider on what boundary data is
split when using huge pages on NUMA systems.
One work around for this instance of the general problem is to use
MPI in combination with OpenMP. The use of MPI allows division of the
workload with one MPI process per NUMA node. Each MPI process is bound
to the list of CPUs local to a node. Parallelisation within the node
is achieved using OpenMP, thus alleviating the issue of remote access.
In this article, the background to huge pages were introduced, what the
performance benefits can be and some basics of how huge pages behave on Linux.
The next article (to appear in the near future) discusses the interfaces
used to access huge pages.
Read the successive installments:
Details of publications referenced in these articles can be found in the bibliography at the end of Part 5.
This material is based upon work supported by the Defense Advanced Research
Projects Agency under its Agreement No. HR0011-07-9-0002. Any opinions,
findings and conclusions or recommendations expressed in this material
are those of the author and do not necessarily reflect the views of the
Defense Advanced Research Projects Agency.
Comments (18 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>