Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.10-rc6, which was released on June 15. In the announcement, Linus Torvalds noted that the patch rate (226 changes since -rc5) seems to be slowing a little bit. "But even if you're a luddite, and haven't yet learnt the guilty pleasures of a git workflow, you do want to run the latest kernel, I'm sure. So go out and test that you can't find any regressions. Because we have fixes all over..."

Stable updates: The 3.9.6, 3.4.49, and 3.0.82 stable kernels were released by Greg Kroah-Hartman on June 13. The 3.2.47 stable kernel was released by Ben Hutchings on June 19.

The 3.9.7, 3.4.50, and 3.0.83 kernels are in the review process and should be expected June 20 or shortly after that.

Comments (none posted)

Quotes of the week

OK, I haven't found a issue here yet, but youss are being trickssy! We don't like trickssy, and we must find precccciouss!!!

This code is starting to make me look like Gollum.

— Steven Rostedt (Your editor will tactfully refrain from comment on how he looked before).

As far as I'm concerned, everything NetWare-related is best dealt by fine folks from Miskatonic University, with all the precautions due when working with spawn of the Old Ones...

— Al Viro

Besides, hamsters really are evil creatures.

Sure, you may love your fluffy little Flopsy the dwarf hamster, but behind that cute and unassuming exterior lies a calculating and black little heart.

So hamster-cursing pretty much doesn't need any excuses. They have it coming to them.

— Linus Torvalds

Sure, I'll gladly accept "I can do it later" from anyone, as long as you don't mind my, "I will merge it later" as well :)

— Greg Kroah-Hartman

Comments (3 posted)

A power-aware scheduling update

By Jonathan Corbet
June 19, 2013

Earlier this month, LWN reported on the "line in the sand" drawn by Ingo Molnar with regard to power-aware scheduling. The fragmentation of CPU power management responsibilities between the scheduler, CPU frequency governors, and CPUidle subsystem had to be replaced, he said, by an integrated solution that put power management decisions where the most information existed: in the scheduler itself. An energetic conversation followed from that decree, and a possible way forward is beginning to emerge. But the problem remains difficult.

Putting the CPU scheduler in charge of CPU power management decisions has a certain elegance; the scheduler is arguably in the best position to know what the system's needs for processing power will be in the near future. But this idea immediately runs afoul of another trend in the kernel: actual power management decisions are moving away from the scheduler toward low-level hardware driver code. As Arjan van de Ven noted in a May Google+ discussion, power management policies for Intel CPUs are being handled by CPU-specific code in recent kernels:

We also, and I realize this might be controversial, combine the control algorithm with the cpu driver in one. The reality is that such control algorithms are CPU specific, the notion of a generic "for all cpus" governors is just outright flawed; hardware behavior is key to the algorithm in the first place.

Arjan suggests that any discussion that is based on control of CPU frequencies and voltages misses an important point: current processors have a more complex notion of power management, and they vary considerably from one hardware generation to the next. The scheduler is not the right place for all that low-level information; instead, it belongs in low-level, hardware-specific code.

There is, however, fairly widespread agreement that passing more information between the scheduler and the low-level power management code would be helpful. In particular, there is a fair amount of interest in better integration of the scheduler's load-balancing code (which decides how to distribute processes across the available CPUs) and the power management logic. The load balancer knows what the current needs are and can make some guesses about the near future; it makes sense that the same code could take part in deciding which CPU resources should be available to handle that load.

Based on these thoughts and more, Morten Rasmussen has posted a design proposal for a reworked, power-aware scheduler. The current scheduler would be split into two separate modules:

The CPU scheduler, which is charged with making the best use of the CPU resources that are currently available to it.
The "power scheduler," which takes the responsibility of adjusting the currently available CPU resources to match the load seen by the CPU scheduler.

The CPU scheduler will handle scheduling as it is done now. The power scheduler, instead, takes load information from the CPU scheduler and, if necessary, makes changes to the system's power configuration to better suit that load. These changes can include moving CPUs from one power state to another or idling (or waking) CPUs. The power scheduler would talk with the current frequency and idle drivers, but those drivers would remain as separate, hardware-dependent code. In this design, load balancing would remain with the CPU scheduler; it would not move to the power scheduler.

Of course, there are plenty of problems to be solved beyond the simple implementation of the power scheduler and the definition of the interface with the CPU scheduler. The CPU scheduler still needs to learn how to deal with processors with varying computing capacities; the big.LITTLE architecture requires this, but more flexible power state management does too. Currently, processes are charged by the amount of time they spend executing in a CPU; that is clearly unfair to processes that are scheduled onto a slower processor. So charging will eventually have to change to a unit other than time; instructions executed, for example. The CPU scheduler will need to become more aware of the power management policies in force. Scheduling processes to enable the use of "turbo boost" mode (where a single CPU can be overclocked if all other CPUs are idle) remains an open problem. Thermal limits will throw more variables into the equation. And so on.

It is also possible that the separation of CPU and power scheduling will not work out; as Morten put it:

I'm aware that the scheduler and power scheduler decisions may be inextricably linked so we may decide to merge them. However, I think it is worth trying to keep the power scheduling decisions out of the scheduler until we have proven it infeasible.

Even with these uncertainties, the "power scheduler" approach should prove to be a useful starting point; Morten and his colleagues plan to post a preliminary power scheduler implementation in the near future. At that point we may hear how Ingo feels about this design relative to the requirements he put forward; he (along with the other core scheduler developers) has been notably absent from the recent discussion. Regardless, it seems clear that the development community will be working on power-aware scheduling for quite some time.

Comments (1 posted)

Tags and IDs

By Jonathan Corbet
June 19, 2013

Our recent coverage of the multiqueue block layer work touched on a number of the changes needed to enable the kernel to support devices capable of handling millions of I/O operations per second. But, needless to say, there are plenty of additional details that must be handled. One of them, the allocation of integer tags to identify I/O requests, seems like a relatively small issue, but it has led to an extensive discussion that, in many ways, typifies how kernel developers look at proposed additions.

Solid-state storage devices will only achieve their claimed I/O rates if the kernel issues many I/O operations in parallel. That allows the device to execute the requests in an optimal order and to exploit the parallelism inherent in having multiple banks of flash storage. If the kernel is not to get confused, though, there must be a way for the device to report the status of specific operations to the kernel; that is done by assigning a tag (a small integer value) to each request. Once that is done, the device can report that, say, request #42 completed, and the kernel will know which operation is done.

If the device is handling vast numbers of operations per second, the kernel will somehow have to come up with an equal number of tags. That suggests that tag allocation must be a fast operation; even a small amount of overhead starts to really hurt when it is repeated millions of times every second. To that end, Kent Overstreet has proposed the merging of a per-CPU tag allocator, a new module with a simple task: allocate unique integers within a given range as quickly as possible.

The interface is relatively straightforward. A "tag pool," from which tags will be allocated, can be declared this way:

    #include <linux/percpu-tags.h>

    struct percpu_tag_pool pool;

Initialization is then done with:

    int percpu_tag_pool_init(struct percpu_tag_pool *pool, unsigned long nr_tags);

where nr_tags is the number of tags to be contained within the pool. Upon successful initialization, zero will be returned to the caller.

The actual allocation and freeing of tags is managed with:

    unsigned percpu_tag_alloc(struct percpu_tag_pool *pool, gfp_t gfp);
    void percpu_tag_free(struct percpu_tag_pool *pool, unsigned tag);

A call to percpu_tag_alloc() will allocate a tag from the given pool. The only use for the gfp argument is to be checked for the __GFP_WAIT flag; if (and only if) that flag is present, the function will wait for an available tag if need be. The return value is the allocated tag, or TAG_FAIL if no allocation is possible.

The implementation works by maintaining a set of per-CPU lists of available tags; whenever possible, percpu_tag_alloc() will simply take the first available entry from the local list, avoiding contention with other CPUs. Failing that, it will fall back to a global list of tags, moving a batch of tags to the appropriate per-CPU list. Should the global list be empty, percpu_tag_alloc() will attempt to steal some tags from another CPU or, in the worst case, either wait for an available tag or return TAG_FAIL. Most of the time, with luck, tag allocation and freeing operations can be handled entirely locally, with no contention or cache line bouncing issues.

The attentive reader might well be thinking that the API proposed here looks an awful lot like the IDR subsystem, which also exists to allocate unique integer identifiers. That is where the bulk of the complaints came from; Andrew Morton, in particular, was unhappy that no apparent attempt had been made to adapt IDR before launching into a new implementation:

The worst outcome here is that idr.c remains unimproved and we merge a new allocator which does basically the same thing.

The best outcome is that idr.c gets improved and we don't have to merge duplicative code.

So please, let's put aside the shiny new thing for now and work out how we can use the existing tag allocator for these applications. If we make a genuine effort to do this and decide that it's fundamentally hopeless then this is the time to start looking at new implementations.

The responses from Kent (and from Tejun Heo as well) conveyed their belief that IDR is, indeed, fundamentally hopeless for this use case. The IDR code is designed for the allocation of identifiers, so it works a little differently: the lowest available number is always returned and the number range is expanded as needed. The lowest-number guarantee, in particular, forces a certain amount of cross-CPU data sharing, putting a limit on how scalable the IDR code can be. The IDR API also supports storing (and quickly looking up) a pointer value associated with each ID, a functionality not needed by users of tags. As Tejun put it, even if the two allocators were somehow combined, there would still need to be two distinct ways of using it, one with allocation ordering guarantees, and one for scalability.

Andrew proved hard to convince, though; he suggested that, perhaps, tag allocation could be implemented as some sort of caching layer on top of IDR. His position appeared to soften a bit, though, when Tejun pointed out that the I/O stack already has several tag-allocation implementations, "and most, if not all, suck". The per-CPU tag allocator could replace those implementations with common code, reducing the amount of duplication rather than increasing it. Improvements of that sort can work wonders when it comes to getting patches accepted.

Things then took another twist when Kent posted a rewrite of the IDA module as the basis for a new attempt. "IDA" is a variant of IDR that lacks the ability to store pointers associated with IDs; it uses many of the IDR data structures but does so in a way that is more space-efficient. Kent's rewrite turns IDA into a separate layer, with the eventual plan of rewriting IDR to sit on top. Before doing that, though, he implemented a new per-CPU ID allocator implementing the API described above on top of the new IDA code. The end result should be what Andrew was asking for: a single subsystem for the allocation of integer IDs that accommodates all of the known use cases.

All this may seem like an excessive amount of discussion around the merging of a small bit of clearly-useful code that cannot possibly cause bugs elsewhere in the kernel. But if there is one thing that the community has learned over the years, it's that kernel developers are far less scalable than the kernel itself. Duplicated code leads to inferior APIs, more bugs, and more work for developers. So it's worth putting some effort into avoiding the merging of duplicated functionality; it is work that will pay off in the long term — and the kernel community is expecting to be around and maintaining the code for a long time.

Comments (none posted)

Merging Allwinner support

By Jake Edge
June 19, 2013

Getting support for their ARM system-on-chip (SoC) families into the mainline kernel has generally been a goal for the various SoC vendors, but there are exceptions. One of those, perhaps, is Allwinner Technology, which makes an SoC popular in tablets. Allwinner seems to have been uninterested in the switch to Device Tree (DT) in the mainline ARM kernel (and the requirement to use it for new SoCs added to the kernel tree). But the story becomes a bit murkier because it turns out that developers in the community have been doing the work to get fully DT-ready support for the company's A1X SoCs into the mainline. While Allwinner is not fully participating in that effort, at least yet, a recent call to action with regard to support for the hardware seems to be somewhat off-kilter.

The topic came up in response to a note from Ben Hutchings on the debian-release mailing list (among others) that was not specifically about Allwinner SoCs at all; it was, instead, about his disappointment with the progress in the Debian ARM tree. Luke Leighton, who is acting as a, perhaps self-appointed, "go-between" for the kernel and Allwinner, replied at length, noting that the company would likely not be pushing its code upstream:

well, the point is: the expectation of the linux kernel developers is that Everyone Must Convert To DT. implicitly behind that is, i believe, an expectation that if you *don't* convert to Device Tree, you can kiss upstream submission goodbye. and, in allwinner's case, that's simply not going to happen.

As might be guessed, that didn't sit well with the Linux ARM crowd. ARM maintainer Russell King had a sharply worded response that attributed the problem directly to Allwinner. He suggested that, instead of going off and doing its own thing with "fex" (which serves many of the same roles that DT does in the mainline), the company could have pitched in and helped fix any deficiencies in DT. In addition, he is skeptical of the argument that DT was not ready when Allwinner needed it:

DT has been well defined for many many years before we started using it on ARM. It has been used for years on both PowerPC and Sparc architectures to describe their hardware, and all of the DT infrastructure was already present in the kernel.

Leighton, though, points to the success of the Allwinner SoCs, as well as the ability for less-technical customers to easily reconfigure the kernel using fex as reasons behind the decision. There are, evidently, a lot of tablet vendors who have limited technical know-how, so not having to understand DT or how to transform it for the bootloader is a major plus:

the ODMs can take virtually any device, from any customer, regardless of the design, put *one* [unmodified, precompiled] boot0, boot1, u-boot and kernel onto it, prepare the script.fex easily when the customer has been struggling on how to start that DOS editor he heard about 20 years ago, and boot the device up, put it into a special mode where the SD/MMC card becomes a JTAG+RS232 and see what's up... all without even removing any screws.

The discussion continued in that vein, with ARM kernel developers stating that the way forward was to support DT while Leighton insisted that Allwinner would just continue to carry its patches in its own tree and that Linux (and its users) would ultimately lose out because of it. Except for one small problem: as Thomas Petazzoni pointed out, Maxime Ripard has been working on support for the Allwinner A1X SoCs—merged into the 3.8 kernel in arch/arm/mach-sunxi.

In fact, it turns out that Ripard has been in contact with Allwinner and gotten data sheets and evaluation boards from it. He pointed Leighton to a wiki that is tracking the progress of the effort. That work has evidently been done on a volunteer basis, as Ripard is interested in seeing mainline support for those SoCs.

In the end, Leighton's messages start to degenerate into what might seem like an elaborate troll evidencing a serious misunderstanding of how Linux kernel development happens. In any case, he seems to think he is in a position to influence Allwinner's management to pursue an upstream course, rather than its current development path. But his demands and his suggestion that he apologize on behalf of the Linux kernel community for "not consulting with you (allwinner) on the decision to only accept device tree" elicited both amazement and anger—for obvious reasons.

Leighton appears to start with the assumption that the Linux kernel and its community need to support Allwinner SoCs, and that they need to beg Allwinner to come inside the tent. It is a common starting point for successful silicon vendors, but time and again has been shown to not be the case at all. In fact, Allwinner's customers are probably already putting pressure on the company to get its code upstream so that they aren't tied to whichever devices and peripherals are supported in the Allwinner tree.

As far as fex goes, several in thread suggested that some kind of translator could be written to produce DT from fex input. That way, customers who want to use a Windows editor to configure their device will just need to run the tool, which could put the resulting flattened DT file into the proper place in the firmware. Very little would change for the customers, but they would immediately have access to the latest Linux kernel with its associated drivers and core kernel improvements.

Alternatively, Allwinner could try to make a technical case for the superiority of fex over DT, as Russell King suggested. It seems unlikely to be successful, as several developers in the thread indicated that it was a less-general solution than DT, but it could be tried. Lastly, there is nothing stopping Allwinner from continuing down its current path. If its customers are happy with the kernels it provides, and it is happy to carry its code out of tree, there is no "Linux cabal" that will try force a change.

Evidently, though, that may not actually be what Allwinner wants. Its efforts to support Ripard's work, along with contacts made by Olof Johansson, Ripard, and others, indicate that Allwinner is interested in heading toward mainline. It essentially started out where many vendors do, but, again like many SoC makers before it, decided that it makes sense to start working with upstream.

We have seen this particular story play out numerous times before—though typically with fewer comedic interludes. In a lot of ways, it is the vendors who benefit most from collaborating with the mainline. It may take a while to actually see that, but most SoC makers end up there eventually—just as with other hardware vendors. There are simply too many benefits to being in the mainline to stay out of tree forever.

Comments (23 posted)

Linus Torvalds Linux 3.10-rc6 ?

Greg KH Linux 3.9.6 ?

Greg KH Linux 3.4.49 ?

Greg KH Linux 3.0.82 ?

Sebastian Andrzej Siewior 3.8.13-rt11 ?

Ben Hutchings Linux 3.2.47 ?

Alexandre Courbot ARM: tegra: add basic support for Trusted Foundations ?

Chander Kashyap ARM: Exynos: Add Exynos5420 SoC support ?

Yinghai Lu x86, ACPI, numa: Parse numa info early ?

Jonas Jensen ARM: mach-moxart: add MOXA ART SoC files ?

Stephen Boyd MSM DT based multi-platform support ?

Ben Dooks Initial big-endian support series ?

Tejun Heo [PATCH percpu/for-3.11] percpu-refcount: use RCU-sched insted of normal RCU ?

Chris Mason rcu skiplists v2 ?

Kent Overstreet idr: Rewrite ida ?

Kent Overstreet idr: Percpu ida ?

Davidlohr Bueso sysv ipc shared mem optimizations ?

Seiji Aguchi trace,x86: irq vector tracepoint support ?

Joern Engel Improve selftests ?

Kishon Vijay Abraham I Generic PHY Framework ?

Andi Shyti misc: bh1770glc features and fixes ?

Javier Martinez Canillas genirq: add irq_get_trigger_type() to get IRQ flags ?

Grant Likely irqdomain: Refactor ?

Oliver Schinagl v3 Driver for Allwinner sunxi Security ID ?

Joel A Fernandes DMA Engine support for AM33XX ?

Peter Hurley lockless n_tty receive path ?

David Herrmann GamePad: Wii U Pro Controller Support ?

Maxime Ripard Add I2C support for Allwinner SoCs ?

Inki Dae dmabuf-sync: Introduce buffer synchronization framework ?

Guennadi Liakhovetski V4L2 clock and asynchronous probing ?

Johannes Berg alx: add a simple AR816x/AR817x device driver ?

Tony Prisk i2c: vt8500: Add support for I2C bus on Wondermedia SoCs ?

Alexey Brodkin ethernet/arc/arc_emac - Add new driver ?

Mike Turquette clk: dt: bindings for mux, divider & gate clocks ?

akhil.goyal@freescale.com Radio device framework ?

Gaël PORTAY led: add Cycle LED trigger. ?

Viresh Kumar CPUFreq: Fix {PRE|POST}CHANGE notification sequence ?

Xiao Guangrong KVM: MMU: update mmu documentation ?

Li Wang Kernel file system client support for punch hole ?

Mark Fasheh btrfs: offline dedupe v2 ?

Jeff Layton [PATCH v3 00/13] locks: scalability improvements for file locking ?

Tang Chen Support hot-remove local pagetable pages. ?

Tang Chen Arrange hotpluggable memory in SRAT as ZONE_MOVABLE. ?

Michal Hocko Soft limit rework ?

Jesse Gross Open vSwitch ?

Tomasz Stanislawski Optimizations for memory handling in SMACK ?

Tejun Heo [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref ?

Li Zefan memcg: make memcg's life cycle the same as cgroup ?

Gao feng Add namespace support for audit ?

David Vrabel xen: maintain an accurate persistent clock in more cases ?

Sasha Levin liblockdep: userspace lockdep ?

Mathieu Desnoyers Userspace RCU 0.7.7 ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

A power-aware scheduling update

Tags and IDs

Merging Allwinner support

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous