Kernel development

Brief items

Kernel release status

The current development kernel is 3.3-rc3, released on February 8. "No big surprises, which is just how I like it. About a third of the patches are in ARM, but the bulk of that is due to the removal of the unused DMA map code form the bcmring support. So no complaints."

Stable updates: 3.0.21, 3.2.6 and 2.6.32.57 were released on February 13 with a long list of important fixes.

For those still using the 2.6.27 kernel, 2.6.27.60 was released on February 12, quickly followed by 2.6.27.61 to fix the inevitable build error. There's lots of fixes that have gone in since 2.6.27.59 was released last April.

Comments (none posted)

Quotes of the week

-static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
+static inline void *wtf_do_i_call_this(size_t n, size_t size, gfp_t flags)
 {
 	if (size != 0 && n > ULONG_MAX / size)
 		return NULL;
-	return __kmalloc(n * size, flags | __GFP_ZERO);
+	return __kmalloc(n * size, flags);
+}
+
+static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
+{
+	return wtf_do_i_call_this(n, size, flags | __GFP_ZERO);
 }

-- Andrew Morton

I was thrilled a year ago at last to discover who Virginia is, celebrated in mm/memory.c and mm/page-writeback.c

-- Hugh Dickins

I'm sitting here wondering what you meant to type when you typed "ftrace pouch". I'm stumped! But you're not allowed to tell us - that would take all the fun out of it.

-- Andrew Morton

Comments (2 posted)

Btrfs: The Swiss army knife of storage (;login:)

The February 2012 issue of ;login: has a detailed overview of Btrfs [PDF] written by developer Josef Bacik. "Btrfs’s snapshotting is simple to use and understand. The snapshots will show up as normal directories under the snapshotted directory, and you can cd into it and walk around like in a normal directory. By default, all snapshots are writeable in Btrfs, but you can create read-only snapshots if you so choose. Read-only snapshots are great if you are just going to take a snapshot for a backup and then delete it once the backup completes. Writeable snapshots are handy because you can do things such as snapshot your file system before performing a system update; if the update breaks your system, you can reboot into the snapshot and use it like your normal file system."

Comments (14 posted)

Lima driver code for the Mali GPU released

The Lima driver project has released the code for its open source graphics driver supporting the Mali-200 and Mali-400 GPUs. "The aim of this driver is to finally bring all the advantages of open source software to ARM SoC graphics drivers. Currently, the sole availability of binary drivers is increasing development and maintenance overhead, while also reducing portability, compatibility and limiting choice. Anyone who has dealt with GPU support on ARM, be it for a linux with a GNU stack, or for an android, knows the pain of dealing with these binaries. Lima is going to solve this for you, but some time is needed still to get there." (Thanks to Paul Wise.)

Comments (9 posted)

Kernel development news

ABS: Android and the kernel mainline

By Jake Edge
February 15, 2012

On the first day of this year's Android Builders Summit, a panel was held to discuss the Android patches to the Linux kernel, including the progress on getting them upstream. The panel consisted of Zach Pfeffer from Linaro, Tim Bird from Sony, Arnd Bergmann of IBM/Linaro, and Greg Kroah-Hartman of the Linux Foundation (LF), with LWN executive editor Jonathan Corbet moderating. The overall feeling from the panel was that things with the Android kernel patches were proceeding apace—recent kernels can boot into an Android user space—but there is still work to be done. While I could not attend ABS this year, this report comes via the magic of streaming video.

Each of the panelists introduced themselves and connected themselves to Android in various ways. Pfeffer is the Android platform lead for Linaro and is responsible for creating Android builds for each of the member companies, Bird represents Sony as the architecture chair of the LF Consumer Electronics working group, Bergmann has recently been working on the cleanup and consolidation of the ARM subtree, and Kroah-Hartman maintains the staging tree where most of the Android code currently lives.

Corbet kicked off the discussion by noting that he had recently looked at the current Red Hat Enterprise Linux (RHEL) kernel, which includes more than 7600 patches on top of the mainline kernel. He also pointed out that the Fedora kernel carries a number of different patches that aren't likely to go upstream anytime soon (utrace, in particular). Given that, "why are we here?", why is Android (and its patches) treated differently than other distributions' kernels, he asked.

Kroah-Hartman said that the Android patches are only about 7000 lines of code and that some UART drivers are larger. But people care more about the Android patches because device makers and others have to pull down all those patches and apply them to get a mainline kernel working with Android. It is different from enterprise kernels, he said, because there are no real downstream users of the source code of those. Bergmann also pointed out that much of the change in the RHEL or Fedora kernels was for things like drivers which don't change the operating system core as some of the Android patches do.

Pfeffer noted that in the past the kernel developers were seen as "the rebel alliance", but that now the Android developers have assumed that role to some extent. Bird pointed out that part of the problem is that due to the success of Android, the kernels for board support packages (BSPs) are being built with Android kernels, rather than kernel.org kernels. That essentially causes a split in the community.

The Android patches are largely in the mainline at this point (in staging), Kroah-Hartman said, except for the wakelocks code. The 3.3-rc3 kernel can boot an Android user space, but lacks the power management features that wakelocks provide, so battery life is poor. Bergmann said that he had seen a demo of Android running on a mainline kernel, and there is "still a long way to go".

One area that needs attention, Bergmann said, is the user-space interfaces of some of the Android features. Those may not be what the kernel developers want to support long-term, so they need to be addressed before the Android patches make their way out of staging and those interfaces become part of the kernel ABI.

The PMEM patches, which provide a means to reserve contiguous memory and to share buffers between the kernel and user space, was the next topic. Corbet noted that PMEM had been in and out of staging twice, but had never been merged. Since then, Android has moved on and is not using PMEM any longer. So, was it the right move not to merge PMEM, he asked.

The panel seemed in agreement that it was right not to merge those patches, with Pfeffer noting that they were simply an expedient to get products out the door. As ARM matures, he said, common usage models will come about, rather than various quick fixes. Kroah-Hartman pointed out that the memory management kernel developers told the PMEM developers "how to do it right", but that no one ever did that work, which is "a problem that we've had forever". Bird agreed with Pfeffer, saying that he is pushing to get things into the mainline, but that if there is pushback, "that's fine", as there are sometimes "quick and dirty" things done to ship products.

Corbet pursued that topic with a mention of the contiguous memory allocator (CMA) that is being pushed by Samsung and is aimed at Android. But he noted that Android itself has moved from PMEM to the ION memory allocator, which duplicates some of the CMA functionality while adding another user-space interface. What should be done about ION, he asked.

Android is not "pushing" ION, Pfeffer said, and if the kernel community doesn't want it, it shouldn't take it. There was a need for a solution that didn't exist in the mainline, so Android went ahead with ION. It may not necessarily be easy to work with all of the Android kernel patches, he said, because of the time pressures it is operating under. In the end, ION could be handled much like PMEM was, he said.

But Bergmann pointed out that ION could be integrated with CMA and the DMA buffer sharing work that are getting close to the mainline. ION and the soon-to-exist mainline solutions are not mutually exclusive, he said. Pfeffer said that the "entire room" doesn't really care which solution is chosen, they just want something that has been integrated and tested. That may be ION now, and something else down the road.

In a sense, this is a result of the longstanding conflict between the needs of the embedded world vs. those of the rest of kernel, Corbet posited. But Kroah-Hartman disagreed, saying that the enterprise distributions have the same problems, just on a different timescale. Those distributions need to ship, often well before the code is upstream. Embedded isn't really different, they just need to get their code upstream as well, he said.

One place where it is different, though, is that in the embedded world things may move from hardware to software at a relatively rapid pace, Pfeffer said. That means that a driver written for hardware JPEG decoding may really only be needed for one six-month product cycle, so that driver really doesn't need to go upstream. Because of that, Bergmann said, many of those drivers are designed to stay out of the mainline. Bird echoed that, saying that sometimes a "fatalistic approach" to kernel development is the pragmatic choice. Even for long-lived code, if there is no hope that it can go upstream, developers will just try to focus on making it maintainable out of tree.

For a long time, the trajectory of Android code was heading away from the mainline, Pfeffer said, but that has started to correct itself. Though the kernel needs more maintainers so that code can get upstream faster, he said. Bergmann agreed that more maintainers would help, but the work needs to be done in a more organized fashion with an eye toward getting the user-space interfaces right.

Bird said that he doesn't expect there to be a need for another panel of this sort because the problem is solving itself at this point. Kroah-Hartman more or less agreed, noting that the Android problems are "nothing new" and that the kernel community has been solving these kinds of problems for 20 years.

With luck, the Android developers won't be adding much more to the core, Corbet said, but what about drivers? CyanogenMod is trying to get Ice Cream Sandwich (Android 4.0) in the hands of its users, but is running into problems with getting drivers from some vendors. What can be done to solve that problem?

Kroah-Hartman noted that things are getting better in that department, but that some companies care about getting their drivers into the mainline, while others don't. The latter don't realize that it will save them money in the long run. He is often talking with various companies, so if there are specific instances of problem drivers, he wants to know about them.

But kernel drivers are not the whole problem, Pfeffer said. In the graphics and video realms, the line between the kernel and user space has been "blown away", he said. There is now firmware, kernel interfaces, binary blobs, and user-space interfaces to do graphics which are all released in lockstep, and often all as binary blobs, which is not a good thing. It isn't an engineering issue, but a legal one, he said. Google's engineers do not want binary blobs, he said, and have been pushing back on vendors for things like the Nexus line of phones.

Bergmann also pointed to the open source driver for the Mali GPU as an indicator of the direction where things are headed. If the other graphics vendors don't get their act together with respect to free drivers, they will not survive, he said.

With that, the 45-minute slot had expired. The upshot seems to be that mainline kernel support for Android is moving along reasonably well. It won't be too long before it will be possible to run Android on a mainline kernel while still maintaining some reasonable battery life. Beyond that, though, the process is working more or less as it should. The out-of-tree Android patches were just another in a long line of hurdles that the kernel community has overcome.

Comments (none posted)

Trusting the hardware too much

By Jonathan Corbet
February 15, 2012

Anybody who does low-level kernel programming for long enough learns that the hardware is not their friend. Expecting the hardware to be nice is a recipe for disaster; instead, one must treat the hardware as if it were a clever and willful dog. With some effort, it can be made to perform impressive tricks, but, given a moment of inattention, it will snag your dinner from the grill and hide under the couch. The good news is that Linux kernel developers understand the nature of their relationship with the hardware and take great care not to trust it too far. Or, at least, that is what we would like to think.

Consider this snippet of code from drivers/char/hpet.c:

	do {
		m = read_counter(&hpet->hpet_mc);
		write_counter(t + m + hpetp->hp_delta, &timer->hpet_compare);
	} while (i++, (m - start) < count);

Here, read_counter() is a thin macro wrapper around readl(). The driver is writing to the timer compare register in a loop, assuming that the "main counter" value read from the HPET will eventually exceed the threshold value. Almost always, that is exactly what happens. But if the HPET ever goes a little bit weird and stops returning something meaningful when the main counter is read, the above code could easily turn into an infinite loop. The kernel is trusting the hardware to be rational, but the hardware may not choose to live up to that expectation.

"Usbmuxd" is a daemon which facilitates communications with various Apple iDevices. Recently, this patch to usbmuxd was recognized to be a security fix for a bug eventually designated as CVE-2012-0065. In short, this daemon would read a serial number string from the device and copy it into an internal array without checking its length. Exploiting this vulnerability is not easy - it requires the ability to plug in a USB device that has been designed to overflow that particular buffer with something interesting. But it is a vulnerability, and it is worth noting that an increasing number of USB devices are really just Linux systems using the "USB gadget" code; creating that malicious device would not be hard to do. So this vulnerability could be interesting to the "leave a malicious USB stick in the parking lot" school of attacker.

This bug, too, is the result of trusting the hardware. As seen here, the hardware could be overtly evil. In other cases, it is just subject to electrical wear, power spikes, cosmic rays, and the varying skills of those who write the firmware - closed source which is never reviewed by anybody. Even in a world where price pressures didn't mandate that each component must cost as little as possible, hardware bugs would be a problem.

By now, the lesson should be clear: driver developers should always regard their hardware with extreme suspicion and take nothing for granted. The problem is that even highly diligent developers (and reviewers) can easily let this kind of bug slip by. In almost all cases, the driver appears to work just fine without the extra sanity checks; the hardware plays along most of the time, after all, until that especially inopportune moment arrives. Sometimes the developer sees the resulting failure, resulting in that "oh, yeah, I have to make sure that the hardware doesn't flake there" moment that is discouragingly common in driver development. Other times, some far away user sees strange problems and nobody really knows why.

What would be nice would a way for the computer to tell developers when they are being overly trusting of the hardware; then it might be possible to skip the "tracking down the weird problem" experience. As it happens, such a way exists in the form of a static analysis tool called Carburizer, developed by Asim Kadav, Matthew J. Renzelmann and Michael M. Swift. Those wanting a lot of information about this tool can find it in this one-page poster [PDF], this ACM Symposium on Operating Systems Principles (SOSP) paper [PDF], or in this rather over-the-top web site.

In short: Carburizer analyzes kernel code, looking for insufficiently robust dealings with the hardware. Its key strength at the moment appears to be the identification of possible infinite loops - loops whose exit condition depends solely on a value obtained from the hardware. There are, it seems, over 1000 such loops in the 3.2.1 kernel. The tool also looks for cases where unchecked values from hardware are used to index arrays or are used directly as pointers, though the false-positive rate seems to be higher for these checks. The result is an output file as linked above, from which developers can go and investigate.

Naturally enough, the tool shows some signs of its academic origins. It is written in Ocaml and requires some modifications to the kernel source tree before it can be run. Carburizer also requires that multi-file drivers be merged into one big file, with the result that the line numbers in the resulting diagnostics do not correspond to the source tree everybody else has. That may be part of why, despite a positive response to a posting of the tool on kernel-janitors in January, 2011, little in the way of actual fixes seems to have resulted. Or it may just be that, so far, these results have only been seen by a relatively small group of developers.

Interestingly, Carburizer can propose fixes of its own. These include putting time limits into potentially infinite loops and adding bounds checks to suspect array references. While it is at it, Carburizer fixes up seemingly unnecessary panic() calls and adds logging code to places where, it thinks, the driver neglects to report a hardware failure. With a separate runtime module, it can even deal with stuck interrupts (the driver is forced into a polling mode) and more. The resulting code has not been posted for consideration, which is not entirely surprising; the fixes are, necessarily, of a highly conservative "don't break the driver" nature. Such fixes are almost certain not to be what a human would write after looking at the code. But the tool is open source, so interested developers can run it themselves to see what it would do.

Meanwhile, even without automatic fixes, these results seem worthy of some attention. Computers can be far better than humans at finding many classes of bugs; when computers have been used in that role, some types of bugs have nearly disappeared from the kernel. Maybe someday we'll have a version of Carburizer that can be folded into a tool like checkpatch; for now, though, we'll have to look at its complaints about our code separately and decide what action is needed.

Comments (11 posted)

Linux support for ARM big.LITTLE

February 15, 2012

This article was contributed by Nicolas Pitre

ARM Ltd recently announced the big.LITTLE architecture consisting of a twist on the SMP systems that we've all gotten accustomed to. Instead of having a bunch of identical CPU cores put together in a system, the big.LITTLE architecture is effectively pushing the concept further by pulling two different SMP systems together: one being a set of "big" and fast processors, the other one consisting of "little" and power-efficient processors.

In practice this means having a cluster of Cortex-A15 cores, a cluster of Cortex-A7 cores, and ensuring cache coherency between them. The advantage of such an arrangement is that it allows for significant power saving when processes that don't require the full performance of the Cortex-A15 are executed on the Cortex-A7 instead. This way, non-interactive background operation, or streaming multimedia decoding, can be run on the A7 cluster for power efficiency, while sudden screen refreshes and similar bursty operations can be run on the A15 cluster to improve responsiveness and interactivity.

Then, how to support this in Linux? This is not as trivial as it may seem initially. Let's suppose we have a system comprising a cluster of four A15 cores and a cluster of four A7 cores. The naive approach would suggest making the eight cores visible to the kernel and letting the scheduler do its job just like with any other SMP system. But here's the catch: SMP means Symmetric Multi-Processing, and in the big.LITTLE case the cores aren't symmetric between clusters.

The Linux scheduler expects all available CPUs to have the same performance characteristics. For example, there are provisions in the scheduler to deal with things like hyperthreading, but this is still an attribute which is normally available on all CPUs in a given system. Here we're purposely putting together a couple of CPUs with significant performance/power characteristic discrepancies in the same system, and we expect the kernel to make the optimal usage of them at all times, considering that we want to get the best user experience together with the lowest possible battery consumption.

So, what should be done? Many questions come to mind:

Is it OK to reserve the A15 cluster just for interactive tasks and the A7 cluster for background tasks?
What if the interactive tasks are sufficiently light to be processed by the small cores at all times?
What about those background tasks that the user interface is actually waiting after?
How to determine if a task using 100% CPU on a small core should be migrated to a fast core instead, or left on the small core because it is not critical enough to justify the increased power usage?
Should the scheduler auto-tune its behavior, or should user-space policies influence it?
If the latter, what would the interface look like to be useful and sufficiently future-proof?

Linaro started an initiative during the most recent Linaro Connect to investigate this problem. It will require a high degree of collaboration with the upstream scheduler maintainers and a good amount of discussion. And given past history, we know that scheduler changes cannot happen overnight... unless your name is Ingo that is. Therefore, it is safe to assume that this will take a significant amount of time.

Silicon vendors and portable device makers are not going to wait though. Chips implementing the big.LITTLE architecture will appear on the market in one form or another, way before a full heterogeneous multi-processor aware scheduler is available. An interim solution is therefore needed soon. So let's put aside the scheduler for the time being.

ARM Ltd has produced a prototype software solution consisting of a small hypervisor using the virtualization extensions of the Cortex-A15 and Cortex-A7 to make both clusters appear to the underlying operating system as if there was only one Cortex-A15 cluster. Because the cores within a given cluster are still symmetric, all the assumptions built into the current scheduler still hold. With a single call, the hypervisor can atomically suspend execution of the whole system, migrate the CPU states from one cluster to the other, and resume system execution on the other cluster without the underlying operating system being aware of the change; just as if nothing has happened.

Taking the example above, Linux would see only four Cortex-A15 CPUs at all times. When a switch is initiated, the registers for each of the 4 CPUs in cluster A are transferred to corresponding CPUs in cluster B, interrupts are rerouted to the CPUs in cluster B, then CPUs in cluster B are resumed exactly where cluster A was interrupted, and, finally, the CPUs in cluster A are powered off. And vice versa for switching back to the original cluster. Therefore, if there are eight CPU cores in the system, only four of them are visible to the operating system at all times. The only visible difference is the observable execution speed, and of course the corresponding change in power consumption when a cluster switch occurs. Some latency is implied by the actual switch of course, but that should be very small and imperceptible by the user.

This solution has advantages such as providing a mechanism which should work for any operating system targeting a Cortex-A15 without modifications to that operating system. It is therefore OS-independent and easy to integrate. However, it brings a certain level of complexity such as the need to virtualize all the differences between the A15 and the A7. While those CPU cores are functionally equivalent, they may differ in implementation details such as cache topology. That would force every cache maintenance operation to be trapped by the hypervisor and translated into equivalent operations on the actual CPU core when the running core is not the one that the operating system thinks is running.

Another disadvantage is the overhead of saving and restoring the full CPU state because, by virtue of being OS-independent, the hypervisor code may not know what part of the CPU is actually being actively used by the OS. The hypervisor could trap everything to be able to know what is being touched allowing partial context transfers, but that would be yet more complexity for a dubious gain. After all, the kernel already knows what is being used in the CPU, and it can deal with differing cache topologies natively, etc. So why not implement this switcher support directly in the kernel given that we can modify Linux and do better?

In fact that's exactly what we are doing i.e. take the ARM Ltd BSD licensed switcher code and use it as a reference to actually put the switcher functionality directly in the kernel. This way, we can get away with much less support from the hypervisor code and improve switching performances by not having to trap any cache maintenance instructions, by limiting the CPU context transfer only to the minimum set of active registers, and by sharing the same address space with the kernel.

We can implement this switcher by modeling its functionality as a CPU speed change, and therefore expose it via a cpufreq driver. This way, contrary to the reference code from ARM Ltd which is limited to a whole cluster switch, we can easily pair each of the A15 cores with one of the A7 cores, and have each of those CPU pairs appear as a single pseudo CPU with the ability to change its performance level via cpufreq. And because the cpufreq governors are already available and understood by existing distributions, including Android, we therefore have a straightforward solution with a fast time-to-market for the big.LITTLE architecture that shouldn't cause any controversy.

Obviously the "switcher" as we call it is not replacing the ultimate goal of exposing all the cores to the kernel and letting the scheduler make the right decisions. But it is nevertheless a nice self-contained interim solution that will allow pretty good usage of the big.LITTLE architecture while removing the pressure to come up with scheduler changes quickly.

Comments (61 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.3-rc3 ?

Greg KH Linux 3.2.6 ?

Greg KH Linux 3.0.21 ?

Steven Rostedt 3.0.20-rt36 ?

Greg KH Linux 2.6.32.57 ?

Willy Tarreau Linux 2.6.27.60 ?

Willy Tarreau Linux 2.6.27.61 ?

Architecture-specific

Peter De Schrijver Add tegra30 support for secondary cores ?

Marek Szyprowski ARM: DMA-mapping framework redesign ?

Core kernel code

Gilad Ben-Yossef Reduce cross CPU IPI interference ?

Konstantin Khlebnikov radix-tree: general iterator ?

Rakib Mullick [ANNOUNCEMENT] The Barbershop Load Distribution algorithm for Linux kernel scheduler. ?

Paul E. McKenney rcu: direct algorithmic SRCU implementation ?

Cyrill Gorcunov Resending, c/r series v2 ?

MyungJoo Ham [RFC PATCH] PM / QoS: Introduce new classes: DMA-Throughput and DVFS-Latency ?

MyungJoo Ham PM / QoS: add pm_qos_update_request_timeout API ?

Stanislav Kinsbursky IPC: message queue checkpoint support ?

Development tools

Xiao Guangrong KVM: perf: kvm events analysis tool ?

Stephane Eranian perf: add support for sampling taken branches ?

Jiri Olsa ftrace, perf: Adding support to use function trace ?

Device drivers

Mauro Carvalho Chehab Hardware Events Report Mecanism (HERM) ?

Alan Cox SEP resynchronize ?

MyungJoo Ham Introduce External Connector Class (extcon) ?

Jon Brenner TAOS tsl2x7x ?

Oskar Schirmer add support for mc13892 I2C based touch panel for mx35_3ds ?

Chris Boot firewire-sbp-target: FireWire SBP-2 SCSI target ?

Filesystems and block I/O

Joel Reardon Adding Secure Deletion to UBIFS ?

Jan Kara Generic O_SYNC AIO DIO handling ?

Andrea Righi [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE ?

Vivek Goyal block: online resize of disk partitions ?

Evgeniy Polyakov pohmelfs: call for inclusion ?

Richard Weinberger UBI checkpointing support ?

Memory management

Xi Wang slab: introduce knalloc/kxnalloc ?

Cong Wang highmem: remove the second argument of k[un]map_atomic() ?

Marek Szyprowski Contiguous Memory Allocator ?

Wu Fengguang readahead stats/tracing, backwards prefetching and more (v5) ?

KAMEZAWA Hiroyuki memcg: page cgroup diet ?

Dan Magenheimer staging: ramster: multi-machine memory capacity management ?

Aneesh Kumar K.V hugetlbfs: Add cgroup resource controller for hugetlbfs ?

Security-related

Will Drewry user-supplied seccomp filtering using BPF ?

Virtualization and containers

Tang Liang xen: patches for supporting efi ?

Andrew Stiegmann (stieg) RFC: VMCI for Linux ?

Miscellaneous

Etienne Lorrain Gujin GPL bootloader version 2.8.5 ?

Page editor: Jonathan Corbet
Next page: Distributions>>