Kernel development [LWN.net]

Kernel release status

The 3.8 merge window is still open and patches continue to flow into the mainline repository. See the separate article below for a summary of significant changes for 3.8.

Stable updates: 3.0.57, 3.4.24, 3.6.11 and 3.7.1 were all released on December 17. Note that 3.6.11 is the last planned 3.6 update.

Comments (none posted)

Those who develop kernels for Android devices know how frustrating porting a kernel to a new device has always been. Well if you share that notion and would like this process to get easier than it is right now, you will be pleased to know that Linus Torvalds has announced ARM support in Linux.

— Android Authority has a less-than-authoritative moment.

So the math is confused, the types are confused, and the naming is confused. Please, somebody check this out, because now *I* am confused.

— Linus Torvalds

Comments (1 posted)

3.8 Merge window part 2

By Jonathan Corbet
December 19, 2012

Linus has been busy in the last week; as of this writing, some 6200 changesets have been pulled into the mainline repository since last week's summary. As a result, just over 10,000 changes have been merged overall, making 3.8 the busiest merge window ever and the first to exceed 10,000 patches. And the merging process is not done yet.

Quite a few significant changes have been merged. Among other things, we have seen a decision made on how the development of better NUMA balancing will proceed. Without further ado, the most significant user-visible changes merged in the last week include:

The disagreement over how the kernel's NUMA performance problems should be addressed was partially resolved when Ingo Molnar agreed that Mel Gorman's "balancenuma" patch set should be merged as a base for future development. Balancenuma is intended to get the fundamental infrastructure in place to allow experimentation with placement and migration policies; it adds little in the way of such policies itself. That base code has been merged for 3.8; expect policy-oriented code to be pushed for the 3.9 development cycle.
The huge zero page feature has been merged, greatly reducing memory usage for some use cases.
The kernel memory usage accounting infrastructure has been merged, allowing the placement of limitations on kernel memory use by any specific control group. See the updated Documentation/cgroups/memory.txt file for details on how to use this feature.
The inline data patch set has been merged into the ext4 filesystem. Ext4 can now store data for small files directly in the inode, improving performance and space efficiency. Ext4 also now supports the SEEK_HOLE and SEEK_DATA lseek() operations.
The Btrfs filesystem has a new "replace" operation to allow the efficient replacement of a single drive in a volume.
The tmpfs filesystem now supports the SEEK_HOLE and SEEK_DATA lseek() operations.
The user namespace completion patch set has been pulled. Eric Biederman says: "This set of changes adds support for unprivileged users to create user namespaces and as a user namespace root to create other namespaces. The tyranny of supporting suid root preventing unprivileged users from using cool new kernel features is broken."
The new system call:
```
    int finit_module(int fd, const char *args, int flags);
```
can be used to load a kernel module from the given file descriptor. This call was added by the ChromeOS developers so that they can accept or reject a module depending on where it is stored in the filesystem.
The batman-adv mesh networking subsystem has gained distributed ARP table support.
The tun/tap network driver and the virtio net driver both now support multiple queues per device.
The QFQ packet scheduler has been upgraded to "QFQ+", which is said to be faster and more capable; see this paper [PDF] for details.
The s390 architecture has gained support for attached PCI buses.
UEFI boot-time variables are now accessible via the new "efivars" virtual filesystem.
The ptrace() system call has a new option flag, PTRACE_O_EXITKILL, which causes all traced processes to receive a SIGKILL signal if the tracing process exits unexpectedly.
New hardware support includes:
- Audio: Wolfson Microelectronics WM8766 and WM8776 codecs, Philips PSC724 Ultimate Edge sound cards, Freescale / iVeia P1022 RDK boards, Maxim max98090 codecs, and Silicon Laboratories 476x AM/FM radio chips.
- Block: LSI MPT Fusion SAS 3.0 host adapters, and Chelsio T4-based 10Gb adapters (FCoE offload support).
- Graphics: NVIDIA Tegra20 display controllers and HDMI outputs.
- Input: ION iCade arcade controllers, Wolfson Microelectronics "Arizona" haptics controllers, Roccat Lua gaming mice, TI ADC/touchscreen controllers, and Dialog Semiconductor DA9055 ONKEY controllers. The kernel has also gained support for human input devices connected via i²c as described in this document downloadable from Microsoft.
- Miscellaneous: TI TPS51632 power regulators, TI TPS80031/TPS80032 power regulators, Versatile Express power regulators, Versatile Express hardware monitoring controllers, Maxim MAX8973 voltage regulators, Dialog Semiconductor DA9055 regulators, NXP Semiconductor PCF8523 realtime clocks (RTCs), Dialog Semiconductor DA9055 RTCs, CLPS711X host SPI controllers, Nvidia Tegra20/Tegra30 SLINK controllers, Nvidia Tegra20 serial flash controllers, Nokia RX-51 (N900) battery controllers, Solomon SSD1307 OLED controllers, Nano River Technologies Viperboard multifunction controllers, Nokia "Retu" multifunction controllers, AMS AS3711 power management chips, and Nokia CBUS-attached devices.
- Network: CDC mobile broadband interface model USB-attached adapters, Atheros AR5523-based wireless adapters, Realtek RTL8723AE wireless adapters, Aeroflex Gaisler GRCAN and GRHCAN CAN controllers, and Kvaser CAN/USB interfaces.
- Video4Linux: Samsung S3C24XX/S3C64XX SoC camera interfaces (full-memory write access not required).

In contrast with the large number of new features, the number of significant internal changes has been relatively small. Changes visible to kernel developers include:

The Video4Linux2 layer now supports the use of shared DMA buffers for frame I/O. See the DocBook documentation for details on how to use this feature. Also: the videobuf2 subsystem now supports the use of scatterlists with user-space buffers in the "contiguous" DMA mode.
The input subsystem supports the use of "managed" devices via the new devm_input_allocate_device() function.

One feature that has not been merged is RAID5/6 support for the Btrfs filesystem. Those patches are being prepared for the mainline, though, and can be expected in the 3.9 cycle. Meanwhile, the merge window could stay open until as late as December 24, though Linus has threatened to close it early. The final changes to be merged for 3.8 will be summarized once that closure has happened.

Comments (1 posted)

Virtualization and the perf ABI

By Jake Edge
December 19, 2012

Breaking the application binary interface (ABI) between the kernel and user space is a well-known taboo for Linux. That line may seem a little blurrier to some when it comes to the ABI for tools like perf that ship with the kernel. As a recent discussion on the linux-kernel mailing list shows, though, Linus Torvalds and others still have that line in sharp focus.

The issue stems from what appears to be a fairly serious bug in some x86 processors. Back in July, David Ahern reported that KVM-based virtual machines would crash when recording certain events on the host. On some x86 processors, the "Precise Events Based Sampling" (PEBS) mechanism can be used to gather precise counts of events like CPU cycles. Unfortunately, PEBS and hardware virtualization don't play nicely together.

As Ahern reported, running:

    perf record -e cycles:p -ag -- sleep 10

on the host would reliably crash all of the guests. That particular command will record the events specified, CPU cycles in this case, to a file; more information about perf can be found here. It turns out that PEBS incorrectly treats the contents of the Data Segment (DS) register as a guest address, rather than as a host address. That leads to memory corruption in the guest, which will crash all of the virtual machines on the system. The ":p" (precise) attribute on the cycles event (which can be repeated for higher precision levels as in cycles:pp) asks for more precise measurements, which leads to PEBS being used. Without that attribute, the cycle counts measured are less accurate, but do not cause the VM crashes.

That problem led Peter Zijlstra to change perf_event.c in the kernel to disallow precise measurements unless guest measurement has been specifically excluded. Using the ":H" (host-only) attribute will still allow precise measurements as perf will set the exclude_guest flag on the event. That flag will inhibit PEBS activity while in the guest. In addition, Ahern changed perf so that exclude_guest would be automatically selected if the "precise" attribute was set. There's just one problem with those solutions: existing perf binaries do not set exclude_guest, so users would get an EOPNOTSUPP error.

It turns out that one of those existing users is Torvalds, who complained that:

    perf record -e cycles:pp

no longer worked for him. Ahern suggested using "cycles:ppH", but that elicited an annoyed response from Torvalds. Why should he have to add a new flag to deal with virtualization, when he isn't running it? "That whole 'exclude_guest' test is insane when there isn't any virtualization going on."

Ahern countered that it's worse to have VMs explode because someone runs a precise perf. But that's beside the point, as Torvalds pointed out:

You broke the WORKING case for old binaries in order to give an error return in a case that NEVER EVEN WORKED with those binaries. Don't you see how insane that is?

The 'H' flag is totally the wrong way around. Exactly because it only "fixes" a case that was already working, and makes a case that never worked anyway now return an error value. That's not sane. Since the old broken case never worked, nobody can have depended on it. See why I'm saying that it's the people who use virtualization who should be forced to use the new flag, not the other way around?

Forcing existing perf binary users to change their habits is the crux of the matter. Beyond breaking the ABI, which is clearly not allowed, it makes perf break for real users as Ingo Molnar said: "Old, working binaries are actually our _most_ important usecase: it's 99.9% of our current installed base ...". While it is certainly a problem that older kernels can have all their guests crashed with a simple command, the proper solution is not to require either upgrading perf or changing the flags (which could well be buried in scripts or other automation).

Existing perf binaries set the exclude_guest flag to zero, while binaries that have Ahern's change set it to one. That means newer kernels that seek to fix the crashing guest bug cannot rely on a particular value for that flag. The "proper" way to have handled the problem is to use a new include_guest flag (or similar), which defaults to zero. Older binaries cannot change that flag (since they don't know about it), so the kernel code can use it to exclude the precise flag for guests on x86 systems. Other architectures may not suffer from the same restriction.

Beyond that, Torvalds argues that if the user asks for a precise measurement but doesn't specify either the "H" or "G" (include guests) attribute, the code should try to do the right thing. That means it should measure both the host and guests on systems that support it, while backing off to just the host for x86. Meanwhile it could return EOPNOTSUPP if the user explicitly asks for a broken combination (e.g. precise and include guests on x86). Molnar concurred. Ahern seemed a bit unhappy about things, but said that he would start working on a patch that has not appeared yet.

It is worth noting that Torvalds admitted that he could trivially recompile perf to get around the whole problem; it was a principle that he was standing up for. Even though some tools like perf are distributed with the kernel tree, that does not relax the "no regressions" rule. Some critics of the move to add tools to the kernel tree were concerned that it would facilitate ABI changes that could be glossed over by requiring keeping the tools and kernel in sync. This discussion clearly shows that not to be the case.

Having a way to crash all the VMs on a system is clearly undesirable, but as Torvalds pointed out, that had been true for quite some time. Undesirable behavior does not rise to the level of allowing ABI breakage, however. In addition, distributions and administrators can always limit access to perf to the root user—though that obviously may still lead to unexplained VM crashes as Ahern noted. Molnar pointed out that the virtualization use case is a much smaller piece of the pie, so making everyone else pay for a problem they may never encounter just doesn't make sense. Either through a patch or a revert, it would seem that the "misbehavior" will disappear before 3.8 is released.

Comments (none posted)

Removing uninitialized_var()

By Jonathan Corbet
December 19, 2012

Compiler warnings can be life savers for kernel developers; often a well-placed warning will help to avert a bug that, otherwise, could have been painful to track down. But developers quickly tire of warnings that appear when the relevant code is, in fact, correct. It does not take too many spurious warnings to cause a developer to tune out compiler warnings altogether. So developers will often try to suppress warnings for correct code — a practice which can have undesirable effects in the longer term.

GCC will, when run with suitable options, emit a warning if it believes that the value of a variable might be used before that variable is set. This warning is based on the compiler's analysis of the paths through a function; if it believes it can find a path where the variable is not initialized, an "uninitialized variable" warning will result. The problem is that the compiler is not always smart enough to know that a specific path will never be taken. As a simple example, consider uhid_hid_get_raw() in drivers/hid/uhid.c:

    size_t len;
    /* ... */
    return ret ? ret : len;

A look at the surrounding code makes it clear that, in the case where ret is set to zero, the value of len has been set accordingly. But the compiler is unable to figure that out and warns that len might be used in an uninitialized state.

The obvious response to such a warning is to simply change the declaration of len so that the variable starts out initialized:

    size_t len = 0;

Over the years, though, this practice has been discouraged on the kernel mailing lists. The unneeded initialization results in larger code and a (slightly) longer run time. And, besides, it is most irritating to be pushed around by a compiler that is not smart enough to figure out that the code is correct; Real Kernel Hackers don't put up with that kind of thing. So, instead, a special macro was added to the kernel:

    /* <linux/compiler-gcc.h> */
    #define uninitialized_var(x) x = x

It is used in declarations in this manner:

    size_t uninitialized_var(len);

This macro has the effect of suppressing the warning, but it doesn't cause any additional code to be generated by the compiler. This macro has proved reasonably popular; a quick grep shows over 280 instances in the 3.7+ mainline repository. That popularity is not surprising: it allows a kernel developer to turn off a spurious warning and to document the fact that the use of the variable is, indeed, correct.

Unfortunately, there are a couple of problems with uninitialized_var(). One is that, at the same time that it is fooling GCC into thinking that the variable is initialized, it is also fooling it into thinking that the variable is used. If the variable is never referenced again, the compiler will still not issue an "unused variable" warning. So, chances are, there are a number of excess variables that have not been removed because nobody has noticed that they are not actually used. That is a minor irritation, but one could easily decide that it is tolerable if it were the only problem.

The other problem, of course, is that the compiler might just be right. During the 3.7 merge window, a patch was merged that moved some extended attribute handling code from the tmpfs filesystem into common code. In the process of moving that code, the developer noticed that one variable initialization could be removed, since, it seemed, it would pick up a value in any actual path through the function. GCC disagreed, issuing a warning, so, when this developer wrote a second patch to remove the initialization, he also suppressed the warning with uninitialized_var(). Unfortunately, GCC knew what it was talking about in this case; that code had just picked up a bug where, in a specific set of circumstances, an uninitialized value would be passed to kfree() with predictably pyrotechnic results. That bug had to be tracked down by other developers; it was fixed by David Rientjes on October 17. At that time, Hugh Dickins commented that it was a good example of how uninitialized_var() can go wrong.

And, of course, this kind of problem need not be there from the outset. The code for a given function might indeed be correct when uninitialized_var() is employed to silence a warning. Future changes could introduce a bug that the compiler would ordinarily warn about, except that the warning will have been suppressed. So, in a sense, every uninitialized_var() instance is a trap for the unwary.

That is why Linus threatened to remove it later in October, calling it "an abomination" and saying:

The thing is moronic. The whole thing is almost entirely due to compiler bugs (*stupid* gcc behavior), and we would have been better off with an explicit (unnecessary) initialization that at least doesn't cause random crashes etc if it turns out to be wrong.

In response, Ingo Molnar put together a patch removing uninitialized_var() outright. Every use is replaced with an actual initialization appropriate to the type of the variable in question. A special comment ("/* GCC */") is added as well to make the purpose of the initialization clear.

The patch was generally well received and appears to be ready to go. In October, Ingo said that he would keep it out of linux-next (to avoid creating countless merge conflicts), but would post it for merging right at the end of the 3.8 merge window. As of this writing, that posting has not occurred, but there have been no signs that the plans have changed. So, most likely, the 3.8 kernel will lack the uninitialized_var() macro and developers will have to silence warnings the old-fashioned (and obviously correct) way.

Comments (20 posted)

Greg KH Linux 3.7.1 ?

Greg KH Linux 3.6.11 ?

Greg KH Linux 3.4.24 ?

Greg KH Linux 3.0.57 ?

Yinghai Lu x86, boot, 64bit: Add support for loading ramdisk and bzImage above 4G ?

stefani@seibold.net Add 32 bit VDSO time function support ?

Varun Sethi iommu/fsl: Freescale PAMU driver and IOMMU API implementation. ?

Marc Zyngier KVM on arm64 ?

Catalin Marinas arm64: ARMv8 RTSM model SoC support ?

rob@landley.net build 3.7 kernel without perl ?

Ulf Magnusson [ANNOUNCE] Kconfiglib: a flexible Python Kconfig parser and library - now on GitHub ?

Matthias Kohler Multiple run-queues for BFS ?

Con Kolivas 3.7-ck1, BFS 426 for linux-3.7 ?

Sasha Levin userns: use new hashtable implementation ?

Alexander Gordeev IRQ-bound performance events ?

Namhyung Kim perf report: Add support for event group view (v7) ?

Terje Bergstrom Support for Tegra 2D hardware ?

Naveen Krishna Chatradhi i2c: Implement generic gpio based bus arbitration ?

Roland Stigge gpio: Add block GPIO ?

Fabio Baltieri tx/rx LED trigger support ?

Daniel Jeong regulator: new driver for LP8755 ?

Sjur Brændeland remoteproc: Support bi-directional vdev config space ?

Toshi Kani Hot-plug and Online/Offline framework ?

Arto Meriläinen NVIDIA Tegra support ?

Rob Clark drm/lcdc: add TI LCD Controller DRM driver ?

Tomi Valkeinen Common Display Framework-T ?

Christopher Heiny input: Synaptics RMI4 Touchscreen Driver ?

Boris BREZILLON pwm: atmel: add Timer Counter Block PWM driver ?

Alan Cox goldfish: base support ?

Hans Verkuil RFCv2: Second draft of guidelines for submitting patches to linux-media ?

Darrick J. Wong [PATCH v2.3 0/3] mm/fs: Implement faster stable page writes on filesystems ?

Maxim V. Patlasov fuse: process direct IO asynchronously ?

Tejun Heo block: implement blkcg hierarchy support in cfq ?

Eric Wong fadvise: perform WILLNEED readahead in a workqueue ?

clinew@linux.vnet.ibm.com [RFC v2] Btrfs: Subpagesize blocksize (WIP). ?

Minchan Kim Support volatile for anonymous range ?

Vlad Yasevich Add basic VLAN support to bridges ?

Mimi Zohar ima: enforcing appraise type ?

Casey Schaufler LSM: Multiple concurrent LSMs ?

Paolo Bonzini Multiqueue virtio-scsi, and API for piecewise buffer submission ?

Luiz Capitulino auto-ballooning prototype (guest part) ?

Karel Zak util-linux v2.22.2 ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

3.8 Merge window part 2

Virtualization and the perf ABI

Removing uninitialized_var()

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous