Brief items
The 2.6.34 merge window is open so there is no development kernel
release to mention at this time. See the separate article, below, for a
summary of what has been merged for 2.6.34 so far.
There have been no stable updates released over the last week, and
none are currently in the review process.
Comments (none posted)
So guys: feel free to rebase. But when you do, wait a week
afterwards. Don't "rebase and ask Linus to pull". That's just
_wrong_. It means that the tree you asked me to pull has gotten
zero testing.
--
Linus Torvalds
yikes, that macro should be killed with a stick before it becomes
self-aware and starts breeding.
--
Andrew Morton tries to save us all
Comments (none posted)
The Free Software Foundation Latin America has sent out an announcement for
its 2.6.33-libre kernel distribution. "
Linux hasn't been Free
Software since 1996, when Mr Torvalds accepted the first pieces of non-Free
Software in the distributions of Linux he has published since 1991. Over
these years, while this kernel grew by a factor of 14, the amount of
non-Free firmware required by Linux drivers grew by an alarming factor of
83. We, Free Software users, need to join forces to reverse this trend,
and part of the solution is Linux-libre, whose release 2.6.33-libre was
recently published by FSFLA, bringing with it freedom, major improvements
and plans for the future." Many words are expended on their
motivations and methods, but they don't get around to saying where to get
the package; interested users should look
over
here.
Full Story (comments: 131)
By Jonathan Corbet
March 3, 2010
Every merge window seems to exhibit a theme or two, usually along the lines
of "how not to try to merge code." This time around, it seems to be
configuration options; a few new features have shown up with their
associated configuration options set to "yes" by default. That goes
against established practice and tends to make Linus grumpy. He
put it this way:
But if it's not an old feature that used to not have a config
option at all, and it doesn't cure cancer, you never EVER do
"default y". Because when I do "make oldconfig", and I see a "Y"
default, it makes me go "I'm not pulling that piece of sh*t".
The message seems clear: new features aimed at the mainline should not be
configured in by default.
Comments (none posted)
By Jonathan Corbet
March 3, 2010
For the last few years, the development community interested in
implementing containers has been working to add a variety of namespaces to
the kernel. Each namespace wraps around a specific global kernel resource
(such as the network environment, the list of running processes, or the
filesystem tree), allowing different containers to have different views of
that resource. Namespaces are tightly tied to process trees; they are
created with new processes through the use of special flags to the
clone() system call. Once created, a namespace is only visible to
the newly-created process and any children thereof, and it only lives as
long as those processes do. That works for many situations, but there are
others where it would be nice to have longer-lived namespaces which are
more readily accessible.
To that end, Eric Biederman has proposed the creation of a pair
of new system calls. The first is the rather tersely named
nsfd():
int nsfd(pid_t pid, unsigned long nstype);
This system call will find the namespace of the given nstype which
is in effect for the process identified by pid; the return value
will be a file descriptor which identifies - and holds a reference to -
that namespace. The calling process must be able to use ptrace()
on pid for the call to succeed; in the current patch, only network
namespaces are supported.
Simply holding the file descriptor open will cause the target namespace to
continue to exist, even if all processes within it exit. The namespace can
be made more visible by creating a bind mount on top of it with a command
like:
mount --bind /proc/self/fd/N /somewhere
The other piece of the puzzle is setns():
int setns(unsigned long nstype, int fd);
This system call will make the namespace indicated by fd into the
current namespace for the calling process. This solves the problem of
being able to enter another container's namespace without the somewhat
strange semantics of the once-proposed hijack() system call.
These new system calls are in an early, proof-of-concept stage, so they are
likely to evolve considerably between now and the targeted 2.6.35 merge.
Comments (3 posted)
By Jonathan Corbet
March 3, 2010
Many pixels have been expended about the presence of the Android code in
the mainline kernel, or, more precisely, the lack thereof. There are many
reasons for Android's absence, including the Android team's prioritization of
upcoming handset releases over upstreaming the code and some strong technical
disagreements over some of the Android code. For a while, it seemed that there
might be yet another obstacle: source files named after fish.
Like most products, Android-based handsets go through a series of code
names before they end up in the stores. Daniel Walker cited an example: an HTC handset which
was named "Passion" by the manufacturer. When it got to Google for the
Android work, they concluded that "Mahimahi" would be a good name for it.
It's only when this device got to the final stages that it gained the
"Nexus One" name. Apple's "dirty infringer" label came even later than that.
Daniel asked: which name should be used when bringing this code into the
mainline kernel? The Google developers who wrote the code used the
"mahimahi" name, so the source tree is full of files with names like
board-mahimahi-audio.c; they sit alongside files named after
trout, halibut, and swordfish. Daniel feels these names might be
confusing; for this reason, board-trout.c became
board-dream.c when it moved into the mainline. After all, very
few G1/ADP1 users think that they are carrying trout in their pockets.
The problem, of course, is that this kind of renaming only makes life
harder for people who are trying to move code between the mainline and
Google's trees. Given the amount of impedance which already exists on this
path, it doesn't seem like making things harder is called for. ARM
maintainer Russell King came to that
conclusion, decreeing:
There's still precious little to show in terms of progress on
moving this code towards the mainline tree - let's not put
additional barriers in the way.
Let's keep the current naming and arrange for informative comments
in files about the other names, and use the common name in the
Kconfig - that way it's obvious from the kernel configuration point
of view what is needed to be selected for a given platform, and it
avoids the problem of having effectively two code bases.
That would appear to close the discussion; the board-level Android code can
keep its fishy names. Of course, that doesn't help if the code doesn't
head toward the mainline anyway. The good news is that people have not
given up, and work is being done to help make that happen. With luck,
installing a mainline kernel on a swordfish will eventually be a
straightforward task for anybody.
Comments (20 posted)
Kernel development news
By Jonathan Corbet
March 3, 2010
As of this writing, the 2.6.34 merge window is open, with 4480 non-merge
changeset accepted so far. As usual, your long-suffering (i.e. slow
learning) editor has read through all of them in order to produce this
summary of the most interesting changes. Starting with user-visible
changes:
- The asynchronous
suspend/resume patches have been merged, hopefully leading to
better power usage. There is a new switch
(/sys/power/pm_async) allowing this feature to be turned on
or off globally; per-device switches have been added as well.
- The new "perf lock" command can generate statistics of lock usage and
contention.
- Python scripting
support has been added to the perf tool.
- Dynamic probe points can now be placed based on source line numbers
as well as on byte offsets.
- The SuperH architecture has gained support for three-level page
tables, LZO-compressed kernels, and improved hardware breakpoints.
- Support for running 32-bit x86 binaries has been removed from the ia64
(Itanium) architecture code. It has, evidently, been broken for
almost two years, and nobody noticed.
- The "vhost_net" virtual device has been added. Like the once-proposed
vringfd()
system call, vhost_net allows for efficient network connections into
virtualized environments.
- The networking layer now supports the RFC5082 "Generalized TTL
Security Mechanism," a denial-of-service protection for the BGP
protocol.
- The netfilter subsystem now supports connection tracking for TCP-based
SIP connections.
- The DECnet code has been orphaned, meaning that there is no longer a
maintainer for it. The prevailing opinion seems to be that there are
few or no users of this code left. If there are users
interested in DECnet support on contemporary kernels, it might be good
for them to make their existence known.
- Support for IGMP snooping has been added to the network bridge code;
this support enables the selective forwarding of multicast traffic.
- There is the usual pile of new drivers:
- Processors and systems: RTE SDK7786 SuperH boards,
Bluewater Systems Snapper CL15 boards,
Atmel AT572D940HF-EK development boards,
Nuvoton NUC93X CPUs,
Atmel AT572D940HF processors, and
Timll DevKit8000 boards.
- Input: Logitech Flight System G940 joysticks,
Stantum multitouch panels,
Quanta Optical Touch dual-touch panels,
3M PCT touchscreens,
Ortek WKB-2000 wireless keyboard + mouse trackpads,
MosArt dual-touch panels,
Apple Wireless "Magic" mouse devices,
IMX keypads, and
NEXIO/iNexio USB touchscreens.
- Media: Sonix SN9C2028 cameras,
cpia CPiA (version 1)-based USB cameras,
Micronas nGene PCIe bridges,
AZUREWAVE DVB-S/S2 USB2.0 (AZ6027) receivers,
Telegent tlg2300 based TV cards,
Texas Instruments TVP7002 video decoders,
Edirol UA-101 audio/MIDI interfaces,
Media Vision Jazz16-based sound cards,
Dialog Semiconductor DA7210 Soc codecs,
Wolfson Micro WM8904, WM8978, WM8994, WM2000, and WM8955 codecs, and
SH7722 Migo-R sound devices.
- Network: Intel 82599 Virtual Function Ethernet devices,
Qlogic QLE8240 and QLE8242 Converged Ethernet devices,
PLX90xx PCI-bridge based CAN interfaces,
Micrel KSZ8841/2 PCI Ethernet devices,
Atheros AR8151 and AR8152 Ethernet devices, and
Aeroflex Gaisler GRETH Ethernet MACs.
- Miscellaneous: Coldfire QSPI controllers,
DaVinci and DA8xx SPI modules,
ST-Ericsson Nomadik Random Number Generators,
Freescale MPC5121 built-in realtime clocks,
TI CDCE949 clock synthesizers, and
iMX21 onboard USB host adapters.
Changes visible to kernel developers include:
The merge window is normally open for two weeks, but Linus has suggested that it might be a
little shorter this time around. So, by the time next week's edition comes
out, chances are that the window will be closed and the feature set for
2.6.34 will be complete. Tune in then for a summary of the second half of
this merge window.
Comments (none posted)
March 3, 2010
This article was contributed by Mel Gorman
[
Editor's note: this is the third part in Mel Gorman's series on the use
of huge pages in Linux. For those who missed them, a look at part 1 and part 2 is recommended
before diving into this installment.]
In this chapter, the setup and the administration of huge pages within the
system is addressed.
Part 2 discussed the different interfaces between user and kernel space
such as hugetlbfs and shared memory. For an application to use these
interfaces, though, the system must first be properly configured.
Use of hugetlbfs requires only that the filesystem must be mounted;
shared memory needs additional
tuning and huge pages must also be allocated. Huge pages can be statically
allocated as part of a pool early in the lifetime of the system or the pool
can be allowed to grow dynamically as required. Libhugetlbfs provides a
hugeadm utility that removes much of the tedium involved in these tasks.
1 Identifying Available Page Sizes
Since kernel 2.6.27, Linux has supported more than one huge page
size if the underlying hardware does. There will be one directory per page
size supported in /sys/kernel/mm/hugepages and the "default" huge
page size will be stored in the Hugepagesize field in
/proc/meminfo.
The default huge page size can be important. While hugetlbfs can specify the
page size at mount time, the same option is not available for shared memory or
MAP_HUGETLB. This can be important when using 1G pages on AMD or 16G pages on
Power 5+ and later. The default huge page size can be set either with the last
hugepagesz= option on the kernel command line (see below) or
explicitly with default_hugepagesz=.
Libhugetlbfs provides two means of identifying the huge
page sizes. The first is using the pagesize utility with
the -H switch printing the available huge page sizes and
-a showing all page sizes. The programming equivalent are the
gethugepagesizes() and getpagesizes() calls.
2 Huge Page Pool
Due to the inability to swap huge pages, none are allocated by default,
so a pool must be configured with either a static or a dynamic size. The
static size is the number of huge pages that are pre-allocated and guaranteed
to be available for use by applications. Where it is known
in advance how many huge pages are required, the static size should be set.
The size of the static pool may be set in a number of ways. First, it may be
set at boot-time using the hugepages= kernel boot parameter. If
there are multiple huge page sizes, the hugepagesz= parameter
must be used and interleaved with hugepages= as described in
Documentation/kernel-parameters. For example, Power 5+ and later
support multiple page sizes including 64K and 16M; both could be configured
with:
hugepagesz=64k hugepages=128 hugepagesz=16M hugepages=4
Second, the default huge page pool size can be set with the
vm.nr_hugepages sysctl, which, again, tunes the default huge page
pool. Third, it may be set via sysfs by finding the appropriate
nr_hugepages virtual file below /sys/kernel/mm/hugepages.
Knowing the exact huge page requirements in advance may not be possible.
For example, the huge page requirements may be expected to vary
throughout the lifetime of the system. In this case, the maximum number
of additional huge pages that should be allocated is specified with the
vm.nr_overcommit_hugepages. When a huge page pool does not have
sufficient pages to satisfy a request for huge pages, an attempt to allocate up to
nr_overcommit_hugepages is made. If an allocation fails,
the result will be that mmap() will fail to avoid page fault
failures as described in Huge Page Fault
Behaviour in part 1.
It is easiest to tune the pools with hugeadm. The
--pool-pages-min argument specifies the minimum number of huge
pages that are guaranteed to be available. The --pool-pages-max
argument specifies the maximum number of huge pages that will exist in the
system, whether statically or dynamically allocated. The page size can be
specified or it can be simply DEFAULT. The amount to allocate
can be specified as either a number of huge pages or a size requirement.
In the following example, run on an x86 machine, the 4M huge page pool is being
tuned. As 4M also happens to be the default huge page size, it could also
have been specified as DEFAULT:32M and DEFAULT:64M
respectively.
$ hugeadm --pool-pages-min 4M:32M
$ hugeadm --pool-pages-max 4M:64M
$ hugeadm --pool-list
Size Minimum Current Maximum Default
4194304 8 8 16 *
To confirm the huge page pools are tuned to the satisfaction of requirements,
hugeadm --pool-list will report on the minimum, maximum and
current usage of huge pages of each size supported by the system.
3 Mounting HugeTLBFS
To access the special filesystem described in HugeTLBFS in part 2, it
must first be mounted. What may be less obvious is that this is required to
benefit from the use of the allocation API, or to automatically back
segments with huge pages (as also described in part 2). The default huge page
size is used for the mount if the pagesize= is not used. The
following mounts two filesystem instances with different page sizes as supported
on Power 5+.
$ mount -t hugetlbfs /mnt/hugetlbfs-default
$ mount -t hugetlbfs /mnt/hugetlbfs-64k -o pagesize=64K
Ordinarily it would be the responsibility of the administrator to set the
permissions on this filesystem appropriately. hugeadm provides
a range of different options for creating mount points with different permissions.
The list of options are as follows and are self-explanatory.
- --create-mounts
- Creates a mount point for each available
huge page size on this system under
/var/lib/hugetlbfs.
- --create-user-mounts <user>
-
Creates a mount point for each available huge
page size under /var/lib/hugetlbfs/<user>
usable by user <user>.
- --create-group-mounts <group>
-
Creates a mount point for each available huge
page size under /var/lib/hugetlbfs/<group>
usable by group <group>.
- --create-global-mounts
-
Creates a mount point for each available huge
page size under /var/lib/hugetlbfs/global
usable by anyone.
It is up to the discretion of the administrator whether to call
hugeadm from a system initialization script or to create
appropriate fstab entries. If it is unclear what mount points
already exist, use --list-all-mounts to list all current
hugetlbfs mounts and the options used.
3.1 Quotas
A little-used feature of hugetlbfs is quota support which
limits the number of huge pages that a filesystem instance can use even if
more huge pages are available in the system. The expected use case would
be to limit the number of huge pages available to a user or group. While
it is not currently supported by hugeadm, the quota can be set
with the size= option at mount time.
4 Enabling Shared Memory Use
There are two tunables that are relevant to the use of huge pages with shared
memory. The first is the sysctl kernel.shmmax kernel parameter
configured permanently in /etc/sysctl.conf or temporarily in
/proc/sys/kernel/shmmax. The second is the sysctl
vm.hugetlb_shm_group which stores which group ID (GID)
is allowed to create shared memory segments. For example, lets say a JVM was to
use shared memory with huge pages and ran as the user JVM with UID 1500 and GID
3000, then the value of this tunable should be 3000.
Again, hugeadm is able to tune both of these parameters
with the switches --set-recommended-shmmax and
--set-shm-group. As the recommended value is calculated
based on the size of the static and dynamic huge page pools, this should
be called after the pools have been configured.
5 Huge Page Allocation Success Rates
If the huge page pool is statically allocated at boot-time, then this
section will not be relevant as the huge pages are guaranteed to exist. In
the event the system needs to dynamically allocate huge pages throughout
its lifetime, then external fragmentation may be a problem.
"External fragmentation" in this context refers to the inability of the
system to allocate a huge page even if enough memory is free overall because the
free memory is not physically contiguous. There
are two means by which external fragmentation can be controlled, greatly
increasing the success rate of huge page allocations.
The first means is by tuning vm.min_free_kbytes to a
higher value which helps the kernel's fragmentation-avoidance mechanism.
The exact value depends on the type of system, the number of NUMA nodes
and the huge page size, but hugeadm can calculate and set it
with the --set-recommended-min_free_kbytes switch. If
necessary, the effectiveness of this step can be measured by using the
trace_mm_page_alloc_extfrag tracepoint and ftrace
although how to do it is beyond the scope of this article.
While the static huge page pool is guaranteed to be available as it has
already been allocated, tuning min_free_kbytes improves the
success rate when dynamically growing the huge page pool beyond its minimum
size. The static pool sets the lower bound but there is no guaranteed upper
bound on the number of huge pages that are available. For
example, an administrator might request a minimum pool of 1G and a maximum
pool 8G but fragmentation may mean that the real upper bound is 4G.
If a guaranteed upper bound is required, a memory partition can be created
using either the kernelcore= or movablecore= switch
at boot time. These switches create a Movable zone that can be seen in
/proc/zoneinfo or /proc/buddyinfo. Only pages that
the kernel can migrate or reclaim exist in this zone. By default, huge pages
are not allocated from this zone but it can be enabled by setting either
vm.hugepages_treat_as_movable or using the hugeadm
--enable-zone-movable switch.
6 Summary
In this chapter, four sets of system tunables were described. These relate
to the allocation of huge pages, use of hugetlbfs filesystems, the use of
shared memory, and simplifying the allocation of huge pages when dynamic pool
sizing is in use. Once the administrator has made a choice, it should be
implemented as part of a system initialization script. In the next chapter,
it will be shown how some common benchmarks can be easily converted to use
huge pages.
Comments (6 posted)
By Jake Edge
March 3, 2010
Canonical's kernel manager, Pete Graner, spoke at UbuCon—held
just prior to SCALE 8x—on the "Ubuntu Kernel Development Process".
In the talk, he looked at how Ubuntu decides what goes into the kernel and
how that kernel gets built and tested. It provided an interesting look
inside the process that results in a new kernel getting released for each
new Ubuntu version, which comes along every six months.
Graner manages a "pretty big" group at Canonical, of 25 people
split into two sub-groups, one focused on the kernel itself and the other on
drivers. For each release, the kernel team chooses a "kernel release lead"
(KRL) who is responsible for ensuring that the kernel is ready for the
release and its users. The KRL
rotates among team members with Andy Whitcroft handling Lucid Lynx (10.04)
and Leann Ogasawara slated as KRL for the following ("M" or 10.10) release.
The six-month development cycle is "very challenging", Graner
said. The team needs to be very careful about which drivers—in-tree,
out-of-tree, and staging—are enabled. The team regularly takes some
drivers from the staging tree, and fixes them up a bit, before enabling
them in the Ubuntu tree so that users "get better hardware
coverage".
Once the kernel for a release has been frozen, a new branch is created for
the next release. For example, the Lucid kernel will be frozen in a few
weeks, at which point a branch will be made for the 10.10 release. That
branch will get the latest "bleeding edge" kernel from Linus Torvalds's tree
(presumably 2.6.34-rc1), and the team will start putting the additional
patches onto that branch.
The patches that are rolled into the tree include things from linux-next
(e.g. suspend/resume fixes), any patches that Debian has added to its
kernel, then the Ubuntu-specific patchset. Any of those that have been
merged into the mainline can be dropped from the list, but it is a
"very time-consuming effort" to go through the git tree to
figure all of that out. With each new tag from Torvalds's tree, they do a
git rebase on their tree—as it is not a shared development
tree—"see what conflicts, and deal with those".
The focus and direction for the Ubuntu kernel, like all Ubuntu
features, comes out of the Ubuntu Developer Summit (UDS), which is held
shortly after each release to set goals and make plans for the following
release. Before UDS, the kernel team selects some broad topics and creates
blueprints on the wiki to describe those topics. In the past, they have
focused on things like suspend/resume, WiFi networking, and audio; "a
big one going forward is power management", he said.
The specifications for these features are "broad-brush
high-level" descriptions (e.g. John has a laptop and wants to get 10
hours of battery life). The descriptions are fleshed out into various use
cases, which results in a plan of action. All of the discussion,
decisions, plans, and so on are captured on the UDS wiki
One of the longer kernel sessions at UDS looks at each kernel configuration
option (i.e. the kernel .config file) to determine which should be
enabled. New options are looked at closely to decide whether that feature
is needed, but the existing choices are scrutinized as well.
In addition, Graner said that the team looks at the patches and drivers
that were added to the last kernel to see which of those should be
continued in the next release. He pointed to Aufs as a problematic feature
because it always breaks with each new kernel release and can take up to
three weeks to get it working. They have talked about dropping it, because
Torvalds won't merge it into the mainline, but the live CDs need it.
The kernel team has to balance the Ubuntu community needs as well as
Canonical's business needs, for things like Ubuntu One for example, and
come up with a set of kernel features that will satisfy both. The
discussions about what will get in at UDS can get intense at times, Graner said, "Lucid was
pretty tame, but Karmic was kind of heated".
Lucid will ship with
the 2.6.32 kernel which makes sense for a long-term support (LTS) release.
2.6.32 will be supported as a stable tree release for the next several
years and will be shipped with the next RHEL and SLES. That means it will
get better testing coverage which will lead to a "very stable kernel
for Lucid".
Each stable tree update will be pulled into the Ubuntu kernel tree, but LTS
updates to the kernel will only be pushed out quarterly unless there is a
"high" or "medium" security fix. For new kernel feature development, new
mainline kernel releases and release
candidates are pulled in by the team as well. Graner gave two examples of new
development that is going on in the Ubuntu kernel trees: adding devicetree
support for the ARM architecture, which will reduce the complexity of
supporting multiple ARM kernels, and the AppArmor security module that is
being targeted for the 2.6.34 kernel.
Once the kernel version has been frozen for a release, the management
of that kernel is much more strictly controlled. The only patches that get
applied are those that have a bug associated with them. Stable kernel
patches are "cherry-picked" based on user or security problems. There is a
full-time kernel bug triager that tries to determine if a bug reporter
has included enough information to have any hope of finding the
problem—otherwise it gets dropped. One way to ensure a bug gets
fixed, though, is to "show the
upstream patch that fixes the problem"; if that happens, it will get
pulled into the kernel, Graner said.
There are general freezes for each alpha, beta, and the final release, but
the kernel must already be in the archive by the time of those freezes. Each
time the kernel itself freezes, it "takes almost a full week to build
all of the architectures" that are supported by Ubuntu. There are
more architectures supported by Ubuntu than any other distribution
"that I am aware of", he said. Each build is done in a
virtualized environment with a specific toolchain that can be recreated
whenever an update needs to be built. All of that means the kernel
needs to freeze well in advance of the general release freeze, typically
about a month before.
Once the kernel is ready, it is tested in Montreal in a lab with 500 or
600 machines. The QA team runs the kernels against all that hardware,
which is also a time-consuming process. Previously, the kernels would be
tossed over the wall for users to test, but "now Canonical is trying
to do better" by dedicating more resources to testing and QA.
Managing kernel releases for a distribution is big task, and the details of
that process are not generally very well-known. Graner's talk helped to
change that, which should allow others to become more involved in the
process. Understanding how it all works will help those outside of the
team do a better job of working with the team, which should result in
better kernels for Ubuntu users.
Comments (15 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>