The 3.8 merge window is still open
and patches continue to flow into
the mainline repository. See the separate article below for a summary of
significant changes for 3.8.
Stable updates: 3.0.57,
3.4.24, 3.6.11 and 3.7.1 were all released on December 17.
Note that 3.6.11 is the last planned 3.6 update.
Comments (none posted)
Those who develop kernels for Android devices know how frustrating
porting a kernel to a new device has always been. Well if you share
that notion and would like this process to get easier than it is
right now, you will be pleased to know that Linus Torvalds has
announced ARM support in Linux.
has a less-than-authoritative moment.
So the math is confused, the types are confused, and the naming is
confused. Please, somebody check this out, because now *I* am
— Linus Torvalds
Comments (1 posted)
Kernel development news
Linus has been busy in the last week; as of this writing, some 6200
changesets have been
pulled into the mainline repository since last
. As a result, just over 10,000 changes have been merged
overall, making 3.8 the busiest merge window ever and the first to exceed
10,000 patches. And the merging process is not done yet.
Quite a few significant changes have been merged. Among other things, we
have seen a decision made on how the development of better NUMA balancing
will proceed. Without further ado, the most significant user-visible
changes merged in the last week include:
- The disagreement over how the kernel's
NUMA performance problems should be addressed was partially resolved
when Ingo Molnar agreed that Mel
Gorman's "balancenuma" patch
set should be merged as a base for future development. Balancenuma is
intended to get the fundamental infrastructure in place to allow
experimentation with placement and migration policies; it adds little
in the way of such policies itself. That base code has
been merged for 3.8; expect policy-oriented code to be pushed for the
3.9 development cycle.
- The huge zero page feature has been
merged, greatly reducing memory usage for some use cases.
- The kernel memory usage accounting
infrastructure has been merged, allowing the placement of
limitations on kernel memory use by any specific control group. See
the updated Documentation/cgroups/memory.txt file for
details on how to use this feature.
- The inline data patch set has been
merged into the ext4 filesystem. Ext4 can now store data for small
files directly in the inode, improving performance and space
efficiency. Ext4 also now supports the SEEK_HOLE and
SEEK_DATA lseek() operations.
- The Btrfs filesystem has a new "replace" operation to allow the
efficient replacement of a single drive in a volume.
- The tmpfs filesystem now supports the SEEK_HOLE and
SEEK_DATA lseek() operations.
- The user namespace completion patch
set has been pulled. Eric Biederman says: "This set of
changes adds support for unprivileged users to create user namespaces
and as a user namespace root to create other namespaces. The tyranny
of supporting suid root preventing unprivileged users from using cool
new kernel features is broken."
- The new system call:
int finit_module(int fd, const char *args, int flags);
can be used to load a kernel module from the given file descriptor.
This call was added by the ChromeOS developers so that they can accept
or reject a module depending on where it is stored in the filesystem.
- The batman-adv mesh networking subsystem has gained distributed
ARP table support.
- The tun/tap network driver and the virtio net driver both now support
multiple queues per device.
- The QFQ packet scheduler has been upgraded to "QFQ+", which is said to
be faster and more capable; see this
paper [PDF] for details.
- The s390 architecture has gained support for attached PCI buses.
- UEFI boot-time variables are now accessible via the new "efivars"
- The ptrace() system call has a new option flag,
PTRACE_O_EXITKILL, which causes all traced processes to
receive a SIGKILL signal if the tracing process exits
- New hardware support includes:
Wolfson Microelectronics WM8766 and WM8776 codecs,
Philips PSC724 Ultimate Edge sound cards,
Freescale / iVeia P1022 RDK boards,
Maxim max98090 codecs, and
Silicon Laboratories 476x AM/FM radio chips.
LSI MPT Fusion SAS 3.0 host adapters, and
Chelsio T4-based 10Gb adapters (FCoE offload support).
NVIDIA Tegra20 display controllers and HDMI outputs.
ION iCade arcade controllers,
Wolfson Microelectronics "Arizona" haptics controllers,
Roccat Lua gaming mice,
TI ADC/touchscreen controllers, and
Dialog Semiconductor DA9055 ONKEY controllers.
The kernel has also gained support for human input devices
connected via i²c as described in
this document downloadable from Microsoft.
TI TPS51632 power regulators,
TI TPS80031/TPS80032 power regulators,
Versatile Express power regulators,
Versatile Express hardware monitoring controllers,
Maxim MAX8973 voltage regulators,
Dialog Semiconductor DA9055 regulators,
NXP Semiconductor PCF8523 realtime clocks (RTCs),
Dialog Semiconductor DA9055 RTCs,
CLPS711X host SPI controllers,
Nvidia Tegra20/Tegra30 SLINK controllers,
Nvidia Tegra20 serial flash controllers,
Nokia RX-51 (N900) battery controllers,
Solomon SSD1307 OLED controllers,
Nano River Technologies Viperboard multifunction controllers,
Nokia "Retu" multifunction controllers,
AMS AS3711 power management chips, and
Nokia CBUS-attached devices.
CDC mobile broadband interface model USB-attached adapters,
Atheros AR5523-based wireless adapters,
Realtek RTL8723AE wireless adapters,
Aeroflex Gaisler GRCAN and GRHCAN CAN controllers, and
Kvaser CAN/USB interfaces.
Samsung S3C24XX/S3C64XX SoC camera interfaces (full-memory write
access not required).
In contrast with the large number of new features, the number of
significant internal changes has been relatively small.
Changes visible to kernel developers include:
- The Video4Linux2 layer now supports the use of shared DMA buffers for frame I/O. See
the DocBook documentation for details on how to use this feature.
Also: the videobuf2 subsystem now
supports the use of scatterlists with user-space buffers in the
"contiguous" DMA mode.
- The input subsystem supports the use of "managed" devices via the new
One feature that has not been merged is RAID5/6 support for the Btrfs
filesystem. Those patches are being prepared for the mainline, though, and
can be expected in the 3.9 cycle. Meanwhile, the merge window could stay
open until as late as December 24, though Linus has threatened to
close it early. The final changes to be merged for 3.8 will be summarized
once that closure has happened.
Comments (1 posted)
Breaking the application binary interface (ABI) between the kernel and user
space is a well-known taboo for Linux. That line may seem a little
blurrier to some when it comes to the ABI for tools like perf that ship
with the kernel. As a recent discussion on the linux-kernel mailing list
shows, though, Linus Torvalds and others still have that line in sharp focus.
The issue stems from what appears to be a fairly serious bug in some x86
processors. Back in
July, David Ahern reported
that KVM-based virtual machines would crash when recording certain
events on the host. On some x86 processors, the "Precise Events
Based Sampling" (PEBS) mechanism can be used to gather precise counts of
events like CPU cycles. Unfortunately, PEBS and hardware virtualization
don't play nicely together.
As Ahern reported, running:
perf record -e cycles:p -ag -- sleep 10
on the host would reliably crash all of the guests. That
particular command will record the events specified, CPU
cycles in this case, to a file; more information about perf
. It turns out that PEBS
incorrectly treats the contents of the Data Segment (DS) register as a guest address,
rather than as a host address. That leads to memory
corruption in the guest, which will crash all of the virtual machines on the
" (precise) attribute on the cycles
event (which can be
repeated for higher precision levels as in cycles:pp
) asks for more
which leads to PEBS being used. Without that attribute, the
cycle counts measured are less accurate, but do not cause the VM crashes.
That problem led Peter Zijlstra to change
perf_event.c in the kernel to disallow precise measurements
has been specifically excluded. Using the ":H" (host-only)
attribute will still allow precise measurements as perf will
set the exclude_guest flag on the event. That flag will inhibit
PEBS activity while in the guest. In addition, Ahern changed
perf so that exclude_guest would be automatically
selected if the "precise" attribute was set. There's just one problem with those solutions: existing
perf binaries do not set exclude_guest, so users
would get an EOPNOTSUPP error.
It turns out that one of those existing users is Torvalds, who complained that:
perf record -e cycles:pp
no longer worked for him. Ahern suggested
", but that elicited an annoyed response
from Torvalds. Why should he
have to add a new flag to deal with virtualization, when he isn't running
it? "That whole 'exclude_guest' test is insane when there isn't any
virtualization going on.
Ahern countered that it's worse to have VMs
explode because someone runs a precise perf. But that's beside
the point, as Torvalds pointed out:
You broke the WORKING case for old binaries in order to give an error
return in a case that NEVER EVEN WORKED with those binaries. Don't you
see how insane that is?
The 'H' flag is totally the wrong way around. Exactly because it only
"fixes" a case that was already working, and makes a case that never
worked anyway now return an error value. That's not sane. Since the
old broken case never worked, nobody can have depended on it. See why
I'm saying that it's the people who use virtualization who should be
forced to use the new flag, not the other way around?
Forcing existing perf binary users to change their habits is the
crux of the matter. Beyond breaking the ABI, which is clearly
not allowed, it makes perf break for real users as Ingo Molnar said: "Old, working binaries are actually our _most_
important usecase: it's 99.9% of our current installed base ...".
While it is certainly a problem that older kernels can have all their
guests crashed with a simple command, the proper solution is not to require
either upgrading perf or changing the flags (which could well be
buried in scripts or other automation).
Existing perf binaries set the exclude_guest flag to
zero, while binaries that have Ahern's change set it to one.
That means newer kernels that seek to fix the crashing
guest bug cannot rely on a particular value for that flag. The "proper"
way to have handled the problem is to use a new include_guest
flag (or similar), which defaults to zero. Older binaries cannot change
that flag (since they don't know about it), so the kernel code can use it
to exclude the precise flag for guests on x86 systems. Other architectures
may not suffer from the same restriction.
Beyond that, Torvalds argues that if the
user asks for a precise measurement but doesn't specify either the
"H" or "G" (include
guests) attribute, the code should try to do the right thing. That means it
should measure both the host and guests on systems that support it, while
backing off to just the host for x86. Meanwhile it could return
EOPNOTSUPP if the user explicitly asks for a broken combination
(e.g. precise and include guests on x86). Molnar concurred. Ahern seemed a
bit unhappy about things, but said
that he would start working on a patch that has not appeared yet.
It is worth noting that Torvalds admitted
that he could trivially recompile perf to get around the whole
problem; it was a principle that he was standing up for. Even though some
like perf are distributed with the kernel tree, that does not
relax the "no regressions" rule. Some critics of the move to add tools to
the kernel tree were concerned that it would facilitate ABI changes that
could be glossed over by requiring keeping the tools and kernel in
sync. This discussion clearly shows that not to be the case.
Having a way to crash all the VMs on a system is clearly undesirable, but
as Torvalds pointed out, that had been true for quite some time.
Undesirable behavior does not rise to the level of allowing ABI breakage,
In addition, distributions and administrators can always limit access to
to the root user—though that obviously may still lead to unexplained
Ahern noted. Molnar pointed out that the virtualization use case
much smaller piece of the pie, so making everyone else pay for a problem they
may never encounter just doesn't make sense. Either through a patch or a
revert, it would seem that the
"misbehavior" will disappear before 3.8 is released.
Comments (none posted)
Compiler warnings can be life savers for kernel developers; often a
well-placed warning will help to avert a bug that, otherwise, could have
been painful to track down. But developers quickly tire of warnings that
appear when the relevant code is, in fact, correct. It does not take too
many spurious warnings to cause a developer to tune out compiler warnings
altogether. So developers will often try to suppress warnings for correct
code — a practice which can have undesirable effects in the longer term.
GCC will, when run with suitable options, emit a warning if it believes
that the value of a variable might be used before that variable is set.
This warning is based on the compiler's analysis of the paths through a
function; if it believes it can find a path where the variable is not
initialized, an "uninitialized variable" warning will result. The problem
is that the compiler is not always smart enough to know that a specific
path will never be taken. As a simple example, consider
uhid_hid_get_raw() in drivers/hid/uhid.c:
/* ... */
return ret ? ret : len;
A look at the surrounding code makes it clear that, in the case where
ret is set to zero, the value of len has been set
accordingly. But the compiler is unable to figure that out and warns that
len might be used in an uninitialized state.
The obvious response to such a warning is to simply change the declaration
of len so that the variable starts out initialized:
size_t len = 0;
Over the years, though, this practice has been discouraged on the kernel
mailing lists. The unneeded initialization results in larger code and a
(slightly) longer run time. And, besides, it is most irritating to be
pushed around by a compiler that is not smart enough to figure out that the
code is correct; Real Kernel Hackers don't put up with that kind of thing.
So, instead, a special macro was added to the kernel:
/* <linux/compiler-gcc.h> */
#define uninitialized_var(x) x = x
It is used in declarations in this manner:
This macro has the effect of suppressing the warning, but it doesn't cause
any additional code to be generated by the compiler. This macro has proved
reasonably popular; a quick grep shows over 280 instances in the 3.7+
mainline repository. That popularity is not surprising: it allows a kernel
turn off a spurious warning and to document the fact that the use of the
variable is, indeed, correct.
Unfortunately, there are a couple of problems with
uninitialized_var(). One is that, at the same time that it is
fooling GCC into thinking that the variable is initialized, it is also
fooling it into thinking that the variable is used. If the variable is
never referenced again, the compiler will still not issue an "unused
variable" warning. So, chances are, there are a number of excess variables
that have not been removed because nobody has noticed that they are not
actually used. That is a minor irritation, but one could easily decide
that it is tolerable if it were the only problem.
The other problem, of course, is that the compiler might just be right.
During the 3.7 merge window, a
patch was merged that moved some extended attribute handling code from
the tmpfs filesystem into common code. In the process of moving that code,
the developer noticed that one variable initialization could be removed,
since, it seemed, it would pick up a value in any actual path through the
function. GCC disagreed, issuing a warning, so, when this developer wrote
second patch to remove the initialization, he also suppressed the
warning with uninitialized_var().
Unfortunately, GCC knew what it was talking about in this case; that code
picked up a bug where, in a specific set of circumstances, an uninitialized
value would be passed to kfree() with predictably pyrotechnic
results. That bug had to be tracked down by
other developers; it was fixed by David
Rientjes on October 17. At that time, Hugh Dickins commented that it was a good example of how
uninitialized_var() can go wrong.
And, of course, this kind of problem need not be there from the outset.
The code for a given function might indeed be correct when
uninitialized_var() is employed to silence a warning. Future
changes could introduce a bug that the compiler would ordinarily warn
about, except that the warning will have been suppressed. So, in a sense,
every uninitialized_var() instance is a trap for the unwary.
That is why Linus threatened to remove it
later in October, calling it "an abomination" and saying:
The thing is moronic. The whole thing is almost entirely due to
compiler bugs (*stupid* gcc behavior), and we would have been
better off with an explicit (unnecessary) initialization that at
least doesn't cause random crashes etc if it turns out to be wrong.
In response, Ingo Molnar put together a
patch removing uninitialized_var() outright. Every use is
replaced with an actual initialization appropriate to the type of the
variable in question. A special comment
("/* GCC */") is added as well to make the
purpose of the initialization clear.
The patch was generally well received and appears to be ready to go. In
October, Ingo said that he would keep it
out of linux-next (to avoid creating countless merge conflicts), but would
post it for merging right at the end of the 3.8 merge window. As of this
writing, that posting has not occurred, but there have been no signs that
the plans have changed. So, most likely, the 3.8 kernel will lack the
uninitialized_var() macro and developers will have to silence
warnings the old-fashioned (and obviously correct) way.
Comments (20 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>