Brief items
The current development kernel is 2.6.38-rc1,
released on January 18. "
It's been
two weeks, and the merge window for 2.6.38 is thus closed. And an
interesting merge window it has been." All told, a little over 7,600
changes were merged in this merge window. Some of them affect core code in
fairly invasive ways, and there have been some significant regressions
reported already. People testing 2.6.38-rc1 might want to be just a little
more careful (and better backed up) than usual. See
the
full changelog for all the details.
Stable updates: there have been no stable updates released in the
last week.
Comments (none posted)
Free software's awfully like sausages - wonderfully tasty, but
sometimes you suddenly discover that you've been eating sheep
nostrils for the past 15 years of your life.
--
Matthew Garrett
It is often the case that certain developers don't have a full
understanding of the cost of the code they're writing, even when
they're using C. Not everyone is forced to spend part of their life
looking at the compiler output from their code and seeing what it
actually does, although they ought to be.
--
David Woodhouse
Comments (3 posted)
By Jonathan Corbet
January 19, 2011
It has been almost three years since the
creation of the linux-next tree; during that
time, it has become an indispensable part of the kernel development
process. By the time code is merged into the mainline during the merge
window, it has already seen a fair amount of integration and compilation
testing in linux-next - and even some actual run testing. That has helped
to make the merge window run more smoothly. So it's not surprising that
developers are getting increasingly grumpy when code is seen to be
circumventing linux-next and creating problems in the mainline.
We've had a couple of examples of that grumpiness in the 2.6.38 cycle.
When Al Viro posted his first VFS pull request, linux-next maintainer Stephen
Rothwell complained that this was his first
sighting of that code, despite the fact that it had apparently been around
for a few months. Al is known for pulling together mainline submissions at
the last minute, so this sort of thing is not entirely surprising; it
remains to be seen whether he can be pushed into changing his ways.
The other complaint came after the merging of the transparent huge pages
patch set, which went in by way of Andrew Morton's -mm tree. Tony Luck,
having discovered that the ia64 architecture no longer built in the
mainline, asked:
Didn't Andrew make some rash promise at kernel summit about
stopping eating if "-mm" wasn't included in linux-next by the end
of November? Must be getting pretty hungry by now.
Andrew responded that "It's taking a
while - Stephen and I are discussing a plan." Integrating -mm was
always going to be a bit of a challenge; linux-next is supposed to contain
code which is ready for merging into the mainline, while -mm can carry
under-development code for years. Until that gets worked out, though,
memory management developers are going to be in a bit of a difficult
position; there is no maintainer tree they can get into which feeds into
linux-next. Those developers will need to either get their own trees into
linux-next (an easy thing to do) or take the complaints when code which
lived in -mm is seen by testers for the first time when it hits the
mainline.
Comments (2 posted)
Kernel development news
By Jonathan Corbet
January 19, 2011
As of the
2.6.38-rc1 release, some 7616
non-merge changesets
had been pulled into the mainline kernel. A number of significant changes
have been merged since
last week's summary;
the most interesting changes visible to users are:
- The transparent huge pages feature has
been merged. THP attempts to maximize the use of huge pages in the
system (boosting performance) without requiring application changes or
administrator overhead.
- A new tool called turbostat has been added; it can be used to
obtain various types of performance statistics from Intel processors.
Also added is x86_energy_perf_policy, which can be used to
tweak the performance/power usage tradeoff on Intel CPUs.
- The taskstats API has been changed to use different alignments for
returned values; this may break applications which were dependent on
the old arrangement.
- The kernel can now synchronize its internal time to an external
pulse-per-second (PPS) signal with a high degree of accuracy. The kernel
has also gained the ability to generate (and accept) PPS signals on a
parallel port, assuming one can still find a computer with such a port.
- The x86 architecture can now boot XZ-compressed kernels.
- Basic support for multitouch panels has been added to the human input
devices (HID) layer.
- The kernel now has support for the RFC4106 AES-GCM cryptographic
algorithm.
- The fallocate() system call can now be used to punch holes in the middle of files.
Currently this feature is supported by XFS and OCFS2.
- The XFS filesystem supports the FITRIM ioctl(),
allowing discard operations to be initiated from user space.
"This is not intended to be run during normal workloads, as the
freepsace btree walks can cause large performance degradation."
- The LIO SCSI target core has been merged.
- The block I/O bandwidth controller can now be used with hierarchical
control groups.
- The block layer has a new "events
handling" mechanism. What that
means is that detection of device events (the insertion of an optical
disc, for example) can be done in the drivers, eliminating the need to
poll devices from user space.
- The device mapper dm-crypt target has a new "multikey" mode whereby
different blocks can be encrypted with different keys. The crypt
target is also now able to access encrypted partitions created with
the out-of-tree loop-AES package.
- The device mapper has gained the ability to manage RAID 4/5/6 volumes
using the MD RAID drivers.
- The clone() system call no longer honors the long-deprecated
CLONE_STOPPED flag.
- The btrfs filesystem has gained support for read-only snapshots and
LZO compression.
- New drivers:
- Systems and processors:
ALPHAPROJECT AP-SH4A-3A and AP-SH4AD-0A reference boards,
Acme Systems srl FOX G20 boards,
GeoSIG GS_IA18_S boards,
and Atheros AR71XX/AR724X/AR931X SoCs.
- Audio: HP t5325 audio devices,
Realtek alc562x codecs,
Wolfson Micro WM8770 and WM8995 codecs,
Wolfson Micro WM8958 multi-band compressors, and
Wolfson Micro WM8737 analog-to-digital converters.
- Input: Austria Microsystem AS5011 joysticks.
- Miscellaneous:
NXP Semiconductors PN544 near-field communication chips,
Oki Semiconductor ML7213 IOH GPIO controllers,
Freescale MC13892 PMIC regulators,
Freescale MCF548x watchdog timers,
TI TPS6524X Power regulators,
AMD/ATI SP5100 TCO timer/watchdog chipsets,
Atheros AR71XX/AR724X/AR913X hardware watchdogs,
nVidia TCO timer/watchdog devices,
Intel EG20T platform controller hubs, and
Maxim MAX17042/8997/8966 fuel gauges.
Changes visible to kernel developers include:
- ktest.pl, a script which can automate the process of building,
testing, and bisecting kernels, has been added to the tools
directory.
- The "%pK" format specifier can be used to print the value of
potentially sensitive kernel pointers, especially in places like
/proc files. The behavior of this specifier depends on the
value of /proc/sys/kernel/kptr_restrict; a value of zero
means that kernel pointers will be printed as usual, one causes
pointers to be printed as zero for users without CAP_SYSLOG,
and two hides the pointers for all users.
- cdev_index() has been removed; since there are no in-kernel
users, nobody is likely to notice.
- The new function kref_test_and_get() will take a reference
only if the current reference count is not zero.
- Some new dentry operations have been added to support automounters
within the VFS.
- The fallocate() filesystem callback has been moved from
struct inode_operations to struct file_operations.
With the 2.6.38 feature set complete, the process of stabilizing all of
this new code can continue; expect a final 2.6.38 release sometime in late
March.
Comments (4 posted)
By Jonathan Corbet
January 19, 2011
The memory management unit in almost any contemporary processor can handle
multiple page sizes, but the Linux kernel almost always restricts itself to
just the smallest of those sizes - 4096 bytes on most architectures. Pages
which are larger than that minimum - collectively called "huge pages" - can
offer better performance for some workloads, but that performance benefit
has gone mostly unexploited on Linux. That may change in 2.6.38, though,
with the merging of the transparent huge page feature.
Huge pages can improve performance through reduced page faults (a single
fault brings in a large chunk of memory at once) and by reducing the cost
of virtual to physical address translation (fewer levels of page tables
must be traversed to get to the physical address). But the real advantage
comes from avoiding translations altogether. If the processor must
translate a virtual address, it must go through as many as four levels of
page tables, each of which has a good chance of being cache-cold, and,
thus, slow. For this reason, processors maintain a "translation lookaside
buffer" (TLB) to cache the results of translations. The TLB is often quite
small; running cpuid on your editor's aging desktop machine yields:
cache and TLB information (2):
0xb1: instruction TLB: 2M/4M, 4-way, 4/8 entries
0xb0: instruction TLB: 4K, 4-way, 128 entries
0x05: data TLB: 4M pages, 4-way, 32 entries
So there is room for 128 instruction translations, and 32 data
translations. Such a small cache is easily overrun, forcing the CPU to
perform large numbers of address translations. A single 2MB huge page
requires a single TLB entry; the same memory, in 4KB pages, would need 512
TLB entries. Given that, it's not surprising that the use of huge pages
can make programs run faster.
The main kernel address space is mapped with huge pages, reducing TLB
pressure from kernel code. The only way for user-space to take advantage
of huge pages in current kernels, though, is through the hugetlbfs, which
was extensively documented here in early
2010. Using hugetlbfs requires significant work from both application
developers and system administrators; huge pages must be set aside at boot
time, and applications must map them explicitly. The process is fiddly
enough that use of hugetlbfs is restricted to those who really care and who
have the time to mess with it. Hugetlbfs is often seen as a feature for
large, proprietary database management systems and little else.
There would be real value in a mechanism which would make the use of huge
pages easy, preferably requiring no development or administrative attention
at all. That is the goal of the transparent huge pages (THP) patch, which was
written by Andrea Arcangeli and merged for 2.6.38. In short, THP tries to
make huge pages "just happen" in situations where they would be useful.
Current Linux kernels assume that all pages found within a given virtual
memory area (VMA) will be the same size. To make THP work, Andrea had to
start by getting rid of that assumption; thus, much of the initial part of
the patch series is dedicated to enabling mixed page sizes within a VMA.
Then the patch modifies the page fault handler in a simple way: when a
fault happens, the kernel will attempt to allocate a huge page to satisfy
it. Should the allocation succeed, the huge page will be filled, any
existing small pages in the new page's address range will be released, and
the huge page will be inserted
into the VMA. If no huge pages are available, the kernel falls back to
small pages and the application never knows the difference.
This scheme will increase the use of huge pages transparently, but it does
not yet solve the whole problem. Huge pages must be swappable, lest the
system run out of memory in a hurry. Rather than complicate the swapping
code with an understanding of huge pages, Andrea simply splits a huge page
back into its component small pages if that page needs to be reclaimed.
Many other operations (mprotect(), mlock(), ...) will
also result in the splitting of a page.
The allocation of huge pages depends on the availability of large,
physically-contiguous chunks of memory - something which Linux kernel
programmers can never count on. It is to be expected that those pages will
become available at inconvenient times - just after a process has faulted
in a number of small pages, for example. The THP patch tries to improve
this situation through the addition of a "khugepaged" kernel thread. That
thread will occasionally attempt to allocate a huge page; if it succeeds,
it will scan through memory looking for a place where that huge page can be
substituted for a bunch of smaller pages. Thus, available huge pages
should be quickly placed into service, maximizing the use of huge pages in
the system as a whole.
The current patch only works with anonymous pages; the work to integrate
huge pages with the page cache has not yet been done. It also only handles
one huge page size (2MB). Even so, some useful
performance improvements can be seen. Mel Gorman ran some benchmarks showing improvements of up
to 10% or so in some situations. In general, the results were not as good
as could be obtained with hugetlbfs, but THP is much more likely to
actually be used.
No application changes need to be made to take advantage of THP, but
interested application developers can try to optimize their use of it. A
call to madvise() with the MADV_HUGEPAGE flag will mark a
memory range as being especially suited to huge pages, while
MADV_NOHUGEPAGE will suggest that huge pages are better used
elsewhere. For applications that want to use huge pages, use of
posix_memalign() can help to ensure that large allocations are
aligned to huge page (2MB) boundaries.
System administrators have a number of knobs that they can tweak, all found
under /sys/kernel/mm/transparent_hugepage. The enabled
value can be set to "always" (to always use THP),
"madvise" (to use huge pages only in VMAs marked with
MADV_HUGEPAGE), or "never" (to disable the feature).
Another knob, defrag, takes the same values; it controls whether
the kernel should make aggressive use of memory
compaction to make more huge pages available. There's also a whole set
of parameters controlling the operation of the khugepaged thread; see Documentation/vm/transhuge.txt for all the
details.
The THP patch has had a bit of a rough ride since being merged into the
mainline. This code never appeared in linux-next, so it surprised some
architecture maintainers when it caused build failures in the mainline.
Some bugs have also been found - unsurprising for a patch which is this
large and which affects so much core code. Those problems are being ironed
out, so, while 2.6.38-rc1 testers might want to be careful, THP should be
in a usable state by the time the final 2.6.38 kernel is released.
Comments (8 posted)
By Jonathan Corbet
January 19, 2011
A look inside any contemporary desktop-oriented system is likely to reveal
a process which is steadfastly polling removable drives on the off chance
that somebody might have removed or inserted a disk. Indeed, as your
editor can attest, it can be hard to turn that polling off; there's little
room in the world for strange people who have their own ideas of what they
want to happen when they put a disk into a drive. Be that as it may, if
the system is going to poll drives, it would be nice to do so in the best
way possible. That is not currently the case, but, thanks to a patch by
Tejun Heo, drive polling should be better in 2.6.38.
There are a few problems with how polling is done on Linux; these were
nicely outlined by Tejun in the
patch changelog. Polling on Linux requires opening the device; this is
a somewhat heavyweight operation which does not naturally line up with other
operations which might wake the processor. Opening the device in this way
might interfere with other users; optical disk burning, in particular, is
susceptible to this kind of problem. And polling the disk in this way
generates a different set of commands than Windows uses; as Linux driver
developers have discovered many times, behavior that differs from Windows
is not well tested by vendors and tends to have unpleasant bugs. All that
notwithstanding, user-space polling works well enough most of the time, but
it would still be nice to make it better.
Tejun's patch works by moving the polling into the kernel. That makes the
polling more efficient by removing the need to open the device and by
allowing the kernel to delay polling wakeups until something else is going
on as well. There is a new function added to struct
block_device_operations which should be implemented by drivers:
unsigned int (*check_events) (struct gendisk *disk, unsigned int clearing);
This function should check the device for new events and return a mask of
any which were found. Two events are currently defined:
DISK_EVENT_MEDIA_CHANGE and DISK_EVENT_EJECT_REQUEST, the
latter of which is new. The clearing parameter is a mask of
events which should be cleared until they happen again.
The old media_changed() block device operation still exists, but
its use has been deprecated; drivers should be updated to use
check_events() instead. Drivers should also, before adding a
block device, initialize two new struct gendisk fields:
unsigned int events;
unsigned int async_events;
A mask of all events which can be reported by the device should be stored
in events, while async_events should list the events
which can be reported without needing to poll the device.
A new sysctl knob (block.events_dfl_poll_msecs) tells the kernel
how often it should (by default) poll block devices. A value of zero (the
default, currently) disables polling entirely. Polling intervals for
specific devices can be set in their sysfs directories. If a device says
that it can report all events asynchronously, and no polling interval has
been explicitly set for it, that device will not be polled at all.
Since user space is no longer polling the device with this scheme, it needs
a new way to find out when a disk event has happened. These events are now
signaled via a uevent, meaning they can be handled via udev or some other
utility which is watching those events. Note that any driver which handles
asynchronous event reporting must call kobject_uevent_env() itself
to send the event to user space. No driver in 2.6.38-rc1 does that; the
first developer to add such a call may want to add a helper function to the
core block code for that purpose.
Since polling is disabled by default, the kernel will behave the way it
always has and existing user space applications will work. Once the user
space environments have been changed to take advantage of this feature,
they can turn on kernel polling and stop opening the devices themselves.
That should lead to better power consumption and more reliable operation,
which can only be a good thing.
Comments (2 posted)
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>