LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.6-rc4, released on September 1. "Shortlog appended, as you can see it's just fairly random. I'm hoping we're entering the boring/stable part of the -rc windows, and that things won't really pick up speed just because people are getting home."

Stable updates: no stable updates have been released in the last week, and none are in the review process as of this writing.

Comments (none posted)

Quotes of the week

As every parent knows, a tidy bedroom is very different from a messy one. The number of items in the room may be exactly the same, but the difference between orderly and disorderly arrangements is immediately apparent. Now imagine a house with millions of rooms, each of which is either tidy or messy. A robot in the house can inspect each room to see which state it is in. It can also turn a tidy room into a messy one (by throwing things on the floor at random) and a messy room into a tidy one (by tidying it up). This, in essence, is how a new class of memory chip works.
The Economist on phase-change memory

"RFC" always worries me. I read it as "Really Flakey Code"
Andrew Morton

Sorry for the late response, was too busy drinking with other kernel developers in San Diego and laughing at all you that are still doing real work.
Steven Rostedt

Yes I have now read kernel bugzilla, every open bug (and closed over half of them). An interesting read, mysteries that Sherlock Holmes would puzzle over, a length that wanted a good editor urgently, an interesting line in social commentary, the odd bit of unnecessary bad language. As a read it is however overall not well explained or structured.
Alan Cox

Comments (2 posted)

Kernel development news

KS2012: Regression testing

By Michael Kerrisk
August 30, 2012
2012 Kernel Summit

The "regression testing" slot on day 1 of the 2012 Kernel Summit consisted of presentations from Dave Jones and Mel Gorman. Dave's presentation described his new fuzz testing tool, while Mel's was concerned with some steps to improve benchmarking for detecting regressions.

Trinity: intelligent fuzz testing

Dave Jones talked about a testing tool that he has been working on for the last 18 months. That tool, Trinity, is a type of system call fuzz tester. Dave noted that fuzz testing is nothing new, and that the Linux community has had fuzz testing projects for around a decade. The problem is that past fuzz testers take a fairly simplistic approach, passing random bit patterns in the system call arguments. This suffices to find the really simple bugs, for example, detecting that a numeric value passed to a file descriptor argument does not correspond to a valid open file descriptor. However, once these simple bugs are fixed, fuzz testers tend to simply encounter the error codes (EINVAL, EBADF, and so on) that system calls (correctly) return when they are given bad arguments.

What distinguishes Trinity is the addition of some domain-specific intelligence. The tool includes annotations that describe the arguments expected by each system call. For example, if a system call expects a file descriptor argument, then rather than passing a random number, Trinity opens a range of different types of files, and passes the resulting descriptors to the system call. This allows fuzz testing to get past the simplest checks performed on system call arguments, and find deeper bugs. Annotations are available to indicate a range of argument types, including memory addresses, pathnames, PIDs, lengths, and so on. Using these annotations, Trinity can generate tests that are better targeted at the argument type (for example, the Trinity web site notes that powers of two plus or minus one are often effective for triggering bugs associated with "length" arguments). The resulting tests performed by Trinity are consequently more sophisticated than traditional fuzz testers, and find new types of errors in system calls.

Ted Ts'o asked whether it's possible to bias the tests performed by Trinity in favor of particular kernel subsystems. In response, Dave noted that Trinity can be directed to open the file descriptors that it uses for testing off a particular filesystem (for example, an ext4 partition).

Dave stated that Trinity is run regularly against the linux-next tree as well as against Linus's tree. He noted that Trinity has found bugs in the networking code, filesystem code, and many other parts of the kernel. One of the goals of his talk was simply to encourage other developers to start employing Trinity to test their subsystems and architectures. Trinity currently supports the x86, ia64, powerpc, and sparc architectures.

Benchmarking for regressions

Mel Gorman's talk slot was mainly concerned with improving the discovery of performance regressions. He noted that, in the past, "we talked about benchmarking for patches when they get merged. But there's been much inconsistency over time." In particular, he called out the practice of writing commit changelog entries that simply give benchmark statistics from running a particular benchmarking tool as being nearly useless for detecting regressions.

Mel would like to see more commit changelogs that provide enough information to perform reproducible benchmarks. Leading by example, Mel uses his own benchmarking framework, MMTests, and he has posted historical results from kernels 2.6.32 through to 3.4. What he would like to see is changelog entries that, in addition to giving benchmark results, identify the benchmark framework they use and include (pointers to) the specific configuration used with the framework. (The configuration could be in the changelog, or if too large, it could be stored in some reasonably stable location such as the kernel Bugzilla.)

H. Peter Anvin responded that "I hope you know how hard it is for submitters to give us real numbers at all." But this didn't deter Mel from reiterating his desire for sufficient information to reproduce benchmarking tests; he noted that many regressions take a long time to be discovered, which increases the importance of being able to reproduce past tests.

Ted Ts'o observed that there seemed to be a need for a per-subsystem approach to benchmarking. He then asked whether individual subsystems would even be able come to consensus on what would be a reasonable set of metrics, and noted that those metrics should not take too long to run (since metrics that take a long time to execute are likely not to executed in practice). Mel offered that, if necessary, he would volunteer to help write configuration scripts for kernel subsystems. From there, discussion moved into a few other related topics, without reaching any firm resolutions. However, performance regressions are a subject of great concern to kernel developers, and the topic of reproducible benchmarking is one that will likely be revisited soon.

Comments (none posted)

KS2012: Distributions and upstream

By Michael Kerrisk
September 5, 2012
2012 Kernel Summit

The "distributions and upstream" session of day 1 of the 2012 Kernel Summit focused on a question enunciated by Ted Ts'o: "From an upstream perspective, how can we better help distros?" Responding to that question were two distributor representatives: Ben Hutchings for Debian and Dave Jones for Fedora.

Ben Hutchings asked that, when considering merging a new feature, kernel developers not accept the argument that "this feature is expensive, but that's okay because we'll make it an option". He pointed out that this argument is based on a logical fallacy, since in nearly every case distributions will enable the option, because some users will need it. As an example, Ben mentioned memory cgroups (memcg), which, in their initial release, were rather expensive for performance.

A second point that Ben made was that there are still features that distributions are adding that are not being merged upstream. As an example from last year, he mentioned Android. As a current example, he noted the union mounts feature, which is still not upstream. Inasmuch as keeping features such as these outside of the mainline kernel creates more work for distributions, he would like to see such features more actively merged.

Dave Jones made three points. The first of these was that a lot of Kconfig help texts are "really awful". As a consequence, distribution maintainers have to read the code in order to work out if a feature should be enabled.

Dave's second point is that it would be useful to have an explicit list of regressions at around the -rc3 or -rc4 point in the release cycle. His problem is that regressions often become visible only much later. Finally, Dave noted that Fedora sees a lot of reports from lockdep that no other distributions seem to see. The basic problem underlying both of these points is of course lack of early testing, and at this point Ted Ts'o mused: "can we make it easier for users to run the kernel-of-the-day [in particular, -rc1 and rc2 kernels] and allow them to easily fall back to a stable kernel if it doesn't work out?" There was however no conclusive response in the ensuing discussion.

Returning to the general subject of Kconfig, Matthew Garrett echoed and elaborated on one of points made by Ben Hutchings, noting that Kconfig is important for kernel developers (so that they can strip down a kernel for fast builds). However, because distributors will nearly always enable configuration options (as described above), kernel developers need to ask themselves, "If you don't expect an option to be enabled [by distributors], then why is the option even present?". In passing, Andrea Arcangeli noted one of his pet irritations—one with which most people who have ever built a kernel will be familiar. When running make oldconfig, it is very easy to overstep as one types Enter to accept the default "no" for most options; one suddenly realizes that the answer to an earlier question should have been "yes". At that point of course, there is no way to go back, and one must instead restart from the beginning. (Your editor observes that improving this small problem could be a nice way for a budding kernel hacker to get their hands dirty.)

Comments (19 posted)

KS2012: Lightning talks

By Michael Kerrisk
September 5, 2012
2012 Kernel Summit

The lightning talks on day 1 of the 2012 Kernel Summit were over in, one could say, a flash. There were just two very brief discussions.

Paul McKenney noted that a small number of read-copy update (RCU) users have for some time requested the ability to offload RCU callbacks. Normally, RCU callbacks are invoked on the CPU that registered them. This works well in most cases, but it can result in unwelcome variations in the execution times of user processes running on the same CPU. This kind of variation (also known as operating system jitter) can be reduced by offloading the callbacks—arranging for that CPU's RCU callbacks to be invoked on some other CPU. Paul asked if the ability to offload RCU callbacks was of interest to others in the room. A number of developers responded in the affirmative.

Dan Carpenter noted the existence of Smatch, his static analysis tool that detects various kinds of errors in C source code, pointing out that by now "many of you have received emails from me". (The emails that he referred to contained kernel patches and lists of bugs or potential bugs in kernel code. In the summary of his LPC 2011 presentation, Dan noted that Smatch has resulted in hundreds of kernel patches.) Dan's main point was simply to request other ideas from kernel developers on what checks to add to Smatch; he noted that there is a mailing list, smatch@vger.kernel.org, to which suggestions can be sent.

Comments (none posted)

KS2012: Kernel build/boot testing

By Michael Kerrisk
September 5, 2012
2012 Kernel Summit

The presentation given by Fengguang Wu on day 1 of the 2012 Kernel Summit was about testing for build and boot regressions in the Linux kernel. In the presentation, Fengguang described the test framework that he has established to detect and report these regressions in a more timely fashion.

To summarize the problem that Fengguang is trying to resolve, it's simplest to look at things from the perspective of a maintainer making periodic kernel releases. The most obvious example is of course the mainline tree maintained by Linus, which goes through a series of release candidates on the way to the release of a stable kernel. The linux-next tree maintained by Stephen Rothwell is another example. Many other developers depend on these releases. If for some reason, those kernel releases don't successfully build and boot, then the daily work of other kernel developers is impaired while they resolve the problem. [Fengguang Wu]

Of course, Linus and Stephen strive to ensure that these kinds of build and boot errors don't occur: before making kernel releases, they do local testing on their development systems, and ensure that the kernel builds, boots, and runs for them. The problem comes in when one considers the variety of hardware architectures and configuration options that Linux provides. No single developer can test all combinations of architectures and options, which means that, for some combinations, there are inevitably build and boot errors in the mainline -rc and linux-next releases. These sorts of regressions appear even in the final releases performed by Linus; Fengguang noted the results found by Geert Uytterhoeven, who reported that (for example) in the Linux 3.4 release, his testing found around 100 build error messages resulting from regressions. (Those figures are exaggerated because some errors occur on obscure platforms that see less maintainer attention. But they include a number of regressions on mainstream platforms that have the potential to disrupt the work of many kernel developers.) Furthermore, even when a build problem appears in a series of kernel commits but is later fixed before a mainline -rc release, this still creates a problem: developers performing bisects to discover the causes of other kernel bugs will encounter the build failures during the bisection process.

As Fengguang noted, the problem is that it takes some time for these regressions to be detected. By that time, it may be difficult to determine what kernel change caused the problem and who it should be reported to. Many such reports on the kernel mailing list get no response, since it can be hard to diagnose user-reported problems. Furthermore, the developer responsible for the problem may have moved on to other activities and may no longer be "hot" on the details of work that they did quite some time ago. As a result, there is duplicated effort and lost time as the affected developers resolve the problems themselves.

According to Fengguang, these sorts of regressions are an inevitable part of the development process. Even the best of kernel developers may sometimes fail to test for regressions. When such regressions occur, the best way to ensure they are resolved is to quickly and accurately determine the cause of the regression and promptly notify the developer who caused the regression.

Fengguang's solution to this problem is to automate a solution that detects these regressions and then informs kernel developers by email that their commit X triggered bug Y. Crucially, the email reports are generated nearly immediately (1-hour response time) after commits are merged into the tested repositories. (For this reason, Fengguang calls his system a "0-day kernel test" system.) Since the relevant developer is informed quickly, it's more likely they'll be "hot" on the technical details, and able to fix the problem quickly.

Fengguang's test framework at the Intel Open Source Technology Center consists of a server farm that includes five build servers (three Sandy Bridge and two Itanium systems). On these systems, kernels are built inside chroot jails. The built kernel images are then boot tested inside over 100 KVM instances on another eight test boxes. The system builds and boots each tested kernel configuration, on a commit-by-commit basis for a range of kernel configurations. (The system reuses build outputs from previous commits so as to expedite the build testing. Thus, the build time for the first commit of an allmodconfig build is typically ten minutes, but subsequent commits require two minutes to build on average.)

Tests are currently run against Linus's tree, linux-next, and more than 180 trees owned by individual kernel maintainers and developers. (Running tests against individual maintainers trees helps ensure that problems are fixed before they taint Linus's tree and linux-next.) Together, these trees produce 40 new branch heads and 400 new commits on an average working day. Each day, the system build tests 200 of the new commits. (The system allows trees to be categorized as "rebasable" or "non-rebasable". The latter are usually big subsystem trees for which the maintainers take responsibility to do bisectability tests before publishing commits. Rebaseable trees are tested on a commit-by-commit basis. For non-rebaseable trees, only the branch head is built; only if that fails does the system go though the intervening commits to locate the source of the error. This is why not all 400 of the daily commits are tested.)

The current machine power allows the build test system to test 140 kernel configurations (as well as running sparse and coccinelle) for each commit. Around half of these configurations are randconfig, which are regenerated each day in order to increase test coverage over time. (randconfig builds the kernel with randomized configuration options, so as to find test unusual kernel configurations.) Most of the built kernels are boot tested, including the randconfig ones. Boot tests for the head commits are repeated multiple times to increase the chance of catching less-reproducible regressions. In the end, 30,000 kernels are boot tested in each day. In the process, the system catches 4 new static errors or warnings per day, and 1 boot error every second day.

The responses from the kernel developers in the room were extremely positive to this new system. Andrew Morton noted he'd received a number of useful reports from the tool. "All contained good information, and all corresponded to issues I felt should be fixed." Others echoed Andrew's comments.

One developer in the room asked what he should do if he has a scratch branch that is simply too broken to be tested. Fengguang replied that his build system maintains a blacklist, and specific branches can be added to that blacklist on request. In addition, a developer can include a line containing the string Dont-Auto-Build in a commit message; this causes the build system to skip testing of the whole branch.

Many problems in the system have already been fixed as a consequence of developer feedback: the build test system is fairly mature; the boot test system is already reasonably usable, but has room for further improvement. Fengguang is seeking further input from kernel developers on how his system could be improved. In particular, he is asking kernel developers for runtime stress and functional test scripts for their subsystems. (Currently the boot test system runs a limited set of tools—trinity, xfstests, and a handful of memory management tests—for catching runtime regressions.)

Fengguang's system has already clearly had a strong positive impact on the day-to-day life of kernel developers. With further feedback, the system is likely to provide even more benefit.

Comments (5 posted)

KS2012: Status of Android upstreaming

By Michael Kerrisk
September 5, 2012
2012 Kernel Summit

Anyone who has paid even slight attention to the progress of the mainlining of the Android modifications to the Linux kernel will be aware that the process has had its ups and downs. An initial attempt to mainline the changes via the staging tree ended in failure when the code was removed in kernel 2.6.33 in late 2010. Nevertheless, at the 2011 Kernel Summit, kernel developers indicated a willingness to mainline code from Android, and starting with Linux 3.3, various Android pieces were brought back into the staging tree. (On the Android side this was guided by the Android Mainlining Project.) The purpose of John Stultz's presentation on day 1 of the 2012 Kernel Summit was to review the current status of upstreaming of the Android code and outline the work yet to be done.

John began by reviewing the progress in recent kernel releases. Linux 3.3 reintroduced a number of pieces to staging, including ashmem, binder, logger, and the low-memory killer. With the Linux 3.3 release, it became possible to boot Android on a vanilla kernel. Linux 3.4 added some further pieces to the staging tree and also saw a lot of cleanup of the previously merged code. Subsequent kernels have seen further Android code move to the staging tree, including the wakeup_source feature and the Android Gadget driver. In addition, some code in the staging tree has been converted to use upstream kernel features; for example, Android's alarm-dev feature was converted to use the alarm timers feature added to Linux in kernel 3.0.

As of now (i.e., after the closure of the 3.6 merge window), there still remain some major features to merge, including the ION memory allocator. In addition, various Android pieces still remain in the staging tree (for example, the low-memory killer, ashmem, binder, and logger), and these need to be reworked (or replaced), so that the equivalent functionality is provided in the mainline kernel. However, one has the impression that these technical issues will all be solved, since there's been a general improvement in relations on both sides of the Android/upstream fence; John noted that these days there is much less friction between the two sides, more Android developers are participating in the Linux community, and the Linux community seems more accepting of Android as a project. Nevertheless, John noted a few things that could still be improved on the Android side. In particular, for many releases, the Android developers provided updated code branches for each kernel release, but in more recent times they have skipped doing this for some kernel releases.

Following John's presentation, there was relatively little discussion, which is perhaps an indication of the fact that kernel developers are reasonably satisfied with the current status and momentum of Android upstreaming. Matthew Garrett asked if John has any feeling about whether other projects are making use of the upstreamed Android code. In response, John noted that Android code is being used as the default Board Support Package for some projects, such as Firefox OS. He also mentioned that the volatile ranges code that he is currently developing has a number of potential uses outside of Android.

Matthew was also curious to know if is there anything that the Linux kernel developers could do to help make the design process for features that are going into Android more open. Right now, most Android features are developed in-house, but perhaps a more open-developed solution might have satisfied other users' requirements. There was some back and forth as to how practical any other kind of model would be, especially given the focus of vendors on product deadlines; the implicit conclusion was that anything other than the status quo was unlikely.

Overall, the current status of Android upstreaming is very positive, and certainly rather different from the situation a couple of years ago.

Comments (2 posted)

KS2012: Module signing

By Jake Edge
September 6, 2012
2012 Kernel Summit

From several accounts, day one of this year's Kernel Summit was largely argument-free. There were plenty of discussions, even minor disagreements, but nothing approaching some of the battles of yore. Day three looked like it might provide an exception to that pattern with a discussion of two different patch sets that are both targeted at cryptographically signing kernel modules. In the end, though, the pattern continued, with an interesting, but tame, session.

Kernel modules are inserted into the running kernel, so a rogue module could be used to compromise the kernel in ways that are hard to detect. One way to prevent that from happening is to require that kernel modules be cryptographically signed using keys that are explicitly allowed by the administrator. Before loading the module, the kernel can check the signature and refuse to load any that can't be verified. Those modules could come from a distribution or be built with a custom kernel. Since modules can be loaded based on a user action (e.g. attaching a device or using a new network protocol) or come from a third-party (e.g. binary kernel modules), ensuring that only approved modules can be loaded is a commonly requested feature.

Rusty Russell, who maintains the kernel module subsystem, called the meeting to try to determine how to proceed on module signing. David Howells has one patch set that is based on what has been in RHEL for some time, while Dmitry Kasatkin posted another that uses the digital signature support added to the kernel for integrity management. Howells's patches have been around, in various forms, since 2004, while Kasatkin's are relatively new.

Russell prefaced the discussion with an admonishment that he was not interested in discussing the "politics, ethics, or morality" of module signing. He invited anyone who did want to debate those topics to a meeting at 8pm, which was shortly after he had to leave for his plane. The reason we will be signing modules, he said, is because Linus Torvalds wants to be able to sign his modules.

Kasatkin's approach would put the module signature in the extended attributes (xattrs) of the module file, Russell began, but Kasatkin said that choice was only a convenience. His patches are now independent of the integrity measurement architecture (IMA) and the extended verification module (EVM), both of which use xattrs. He originally used xattrs because of the IMA/EVM origin of the signature code he is using, and he did not want to change the module contents. Since then, he noted a response from Russell to Howells's approach and has changed his patches to add the module signature to the end of the file.

That led Russell into a bit of a historical journey. The original patches from Howells put the signature into an ELF section in the module file. But, because there was interest in having the same signature on both stripped and unstripped module files, there was a need to skip over some parts of the module file when calculating the hash that goes into the signature.

The amount of code needed to parse ELF was "concerning", Russell said. Currently, there are some simple sanity checks in the module-loading code, without any checks for malicious code because the belief was that you had to be root to load a module. While that is still true, the advent of things like secure boot and IMA/EVM has made checking for malicious code a priority. But Russell wants to ensure that the code doing that checking is as simple as possible to verify, which was not true when putting module signatures into ELF sections.

Greg Kroah-Hartman pointed out that you have to do ELF parsing to load the module anyway. There is a difference, though. If the module is being checked for maliciousness, that parsing happens after the signature is checked. Any parsing that is done before that verification is potentially handling untrusted input.

Russell would rather see the signature appended to the module file in some form. It could be a fixed-length signature block, as suggested by Torvalds, or there could be some kind of "magic string" followed by a signature. That would allow for multiple signatures on a module. Another suggestion was to change the load_module() system call so that the signature was passed in, which would "punt" the problem to user space "that I don't maintain anymore", Russell said.

Russell's suggestion was to just do a simple backward search from the end of the module file to find the magic string, but Howells was not happy with that approach for performance reasons. Instead, Howells added a 5-digit ASCII number for the length of the signature, which Russell found a bit inelegant. Looking for the magic string "doesn't take that long", he said, and module loading is not that performance-critical.

There were murmurs of discontent in the room about that last statement. There are those who are very sensitive about module loading times because it impacts boot speed. But, Russell said that he could live with ASCII numbers, as long as there was no need to parse ELF sections in the verification code. He does like the fact that modules can be signed in the shell, which is the reason behind the ASCII length value.

There are Red Hat customers asking for SHA-512 digests signed with 4K RSA keys, Howells said, but that may change down the road. That could make picking a size for a fixed-length signature block difficult. But, as Ted Ts'o pointed out, doing a search for the magic string is in the noise in comparison to doing RSA with 4K keys. The kernel crypto subsystem can use hardware acceleration to make that faster, Howells said. But, Russell was not convinced that the performance impact of searching for the magic string was significant and would like to see some numbers.

James Bottomley asked where the keys for signing would come from. Howells responded that the kernel build process can create a key. The public part would go into the kernel for verification purposes, while the private part would be used for signing. After the signing is done, that ephemeral private key could be discarded. There is also the option to specify a key pair to use.

Torvalds said that it was "stupid" to have stripped modules with the same signature as the unstripped versions. The build process should just generate signatures for both. Having logic to skip over various pieces of the module just adds a new attack point. Another alternative is to only generate signatures for the stripped modules as the others are only used for debugging and aren't loaded anyway, so they can be unsigned, he said. Russell agreed, suggesting that the build process could just call out to something to do the signing.

For binary modules, such as the NVIDIA graphics drivers, users would have to add the NVIDIA public key to the kernel ring, Peter Jones said.

Kees Cook brought up an issue that is, currently at least, specific to Chrome OS. In Chrome OS, there is a trusted root partition, so knowing the origin of a module would allow those systems to make decisions about whether or not to load them. Right now, the interface doesn't provide that information, so Cook suggested changing the load_module() system call (or adding a new one) that passed a file descriptor for the module file. Russell agreed that an additional interface was probably in order to solve that problem.

In the end, Russell concluded that there was a reasonable amount of agreement about how to approach module signing. He planned to look at the two patch sets, try to find the commonality between the two, and "apply something". In fact, he made a proposal, based partly on Howells's approach, on September 4. It appends the signature to the module file after a magic string as Russell has been advocating. As he said when wrapping up the discussion, his patch can provide a starting point to solving this longstanding problem.

Comments (11 posted)

KS2012: ARM: AArch64

By Jake Edge
September 5, 2012
2012 Kernel Summit

Catalin Marinas led a discussion of kernel support for 64-bit ARM processors as part of day two of the ARM minisummit. He concentrated on the status of the in-flight patches to add that support, while pointing to his LinuxCon talk later in the week for more details about the architecture itself.

A second round of the ARM-64 patches was posted to the linux-kernel mailing list in mid-August. After some complaints about the "aarch64" name for the architecture, it was changed to "arm64", at least for the kernel source directory. That name will really only be seen by kernel developers as uname will still report "aarch64", in keeping with the ELF triplet used by the binaries built with GCC.

Some of the lessons learned from the ARM 32-bit support have been reflected in arm64. It will target a single kernel image by default, for example. That means that device tree support is mandatory for AArch64 platforms. Since there are not, as yet, any AArch64 platforms, the patches contain simplified platform code based on that of the Versatile Express.

There are two targets for AArch64 devices: embedded and server. It is possible that ACPI support will be required for the servers. As far as Marinas knows, there is no ACPI implementation out there, but it is not clear what Microsoft is doing in that area.

The code for generic timers and the generic interrupt controller (GIC) lives under the drivers directory. That code could be shared with arch/arm, but there is a need to #ifdef the inline assembly code.

There is an intent to push back on the system-on-a-chip (SoC) vendors regarding things like firmware initialization, boot protocol, and a standardized secure mode API. SoC vendors (and thus, their ARM sub-trees) should be providing the standard interfaces, rather than heading out on their own. The ARM maintainers can choose not to accept ports that do not conform.

That may work for devices targeted at Linux, but there may be SoC vendors who initially target another operating system, as Olof Johannson noted. There will likely need to be some give and take for things such as the boot protocol when Windows, iOS, or OS X targeted devices are submitted. Marinas said that the aim would be for standardization, but they "may have to cope" with other choices at times.

The first code from SoC vendors is not expected before the end of the year, Marinas said. Arnd Bergmann half-jokingly suggested that he would be happy to get a leaked version of that code at any time. The first SoCs might well just be existing 32-bit ARMv7 SoCs with an AArch64 CPU (aka ARMv8) dropped in. That may be the path for embedded applications, though the vendors targeting the server market are likely to be starting from scratch.

That led to a discussion of how to push the arm64 patches forward. Marinas would like to push the core architecture code forward, while working to clean up the example SoC code. He would like to target the 3.8 kernel for the core. Bergmann was strongly in favor of getting it all into linux-next soon, and targeting a merge for the 3.7 development cycle.

Marinas is concerned that including the SoC code will delay inclusion as it will require more review. He also wants to make sure that there is a clean base for those who want to use it as a basis for their own SoC code. That should take two weeks or so, Marinas said. He hopes to get it into linux-next sometime after 3.7-rc1, but Bergmann encouraged a faster approach. There is nothing very risky about doing so, Johannson pointed out, as a new architecture cannot break any existing code.

There is some concern about the 2MB limit on device tree binary (dtb) files because some network controllers (and other devices) may have firmware blobs larger than that. Bergmann noted that those blobs may not be able to be shipped in the kernel, but could be put into firmware and loaded from there. It turns out that the flattened device tree format already has a length entry in its header that can be used to support multiple dtbs, which will allow the 2MB limit to be worked around.

The existing arm64 emulation does not have any DMA, so support for that feature is currently untested. In addition, some SoCs are likely to only support 32-bit DMA. Bergmann suggested an architecture-independent implementation that used dma_ops pointers to provide both coherent and non-coherent versions, but Marinas would like to do something simpler (i.e. coherent only) to start with. Since the "hardware" currently lacks DMA, "all DMA is coherent" seems like a reasonable model, Bergmann said. Since no one will be affected by any bugs in the code, he suggested getting it into linux-next as soon as possible.

Tony Lindgren asked if ARM maintainer Russell King had any comments on the patches. Marinas said that there were not many, at least so far. Bergmann said that he didn't think King was convinced that having a separate arm64 directory (as opposed to adding 64-bit support to the existing arm directory) was the right approach.

Many of the decisions were made for ARM 15 years ago, Marinas said, and some of those make it messy to drop arm64 on top of arm. Some day, when the arm tree only supports ARMv7, it may make sense to merge with arm64. The assembly code cannot be shared, because they are two different architectures, Bergmann said. In addition, the system calls cannot be shared and the platform code is going to be done very differently for arm64, he said.

But, there is room for sharing some things between the two trees, Marinas said. That includes some of the device tree files, perf, the generic timer, the GIC driver code, as well as KVM and Xen if and when they are merged. In theory, the ptrace() and signal-handling code could be shared as well.

Progress is clearly being made for arm64, and we will have to wait and see how quickly it can make its way into the mainline.

Comments (none posted)

KS2012: ARM: A big.LITTLE update

By Jake Edge
September 5, 2012
2012 Kernel Summit

The ARM big.LITTLE architecture is an asymmetric multi-processor platform, with powerful and power-hungry processors coupled with less-powerful (in both senses) CPUs using the same instruction set. Big.LITTLE presents some challenges for the Linux scheduler. Paul McKenney gave a readout of the status of big.LITTLE support at the ARM minisummit, which he really meant to serve as an "advertisement" for the scheduling micro-conference at the Linux Plumbers Conference that started the next day.

The idea behind big.LITTLE is to do frequency and voltage scaling by other means, he said. Because of limitations imposed by physics, there is a floor to frequency and voltage scaling on any given processor, but that can be worked around by adding another processor with fewer transistors. That's what has been done with big.LITTLE.

There are basically two ways to expose the big.LITTLE system to Linux. The first is to treat each pair as a single CPU, switching between them "almost transparently". That has the advantage that it requires almost no changes to the kernel and applications don't know that anything has changed. But, there is a delay involved in making the switch, which isn't taken into account by the power management code, so the power savings aren't as large as they could be. In addition, that approach requires paired CPUs (i.e. one of each size), but some vendors are interested in having one little and many big CPUs in their big.LITTLE systems.

The other way to handle big.LITTLE is to expose all of the processors to Linux, so that the scheduler can choose where to run its tasks. That requires more knowledge of the behavior of processes, so Paul Turner has a patch set that gathers that kind of information. Turner said that the scheduler currently takes averages on a per-CPU basis, but when processes move between CPUs, some information is lost. His changes cause the load average to move with the processes, which will allow the scheduler to make better decisions.

Turner's patches are on their third revision, and have been "baking on our systems at Google" for a few months. There are no real to-dos outstanding, he said. Peter Zijlstra said that he had wanted to merge the previous revision, but that there was "some funky math" in the patches, which has since been changed. Turner said that he measured a 3-4% performance increase using the patches, which means we get "more accurate tracking at lower cost". It seems likely that the patches will be merged soon.

McKenney said that Turner's patches have been adapted by Morten Rasmussen to be used on big.LITTLE systems. The measurements are used to try to determine where a task should be run. Over time, though, the task's behavior can change, so the scheduler checks to see if that has happened and if the placement still makes sense. There are still questions about when "race to idle" versus spreading tasks around makes the most sense, and there have been some related discussions of that recently on the linux-kernel mailing list.

Currently, the CPU hotplug support is less than ideal for removing CPUs that have gone idle. But Thomas Gleixner is reworking things to "make hotplug suck less", McKenney said. For heavy workloads, the process of offlining a processor can take multiple seconds. After Gleixner's rework, that drops to 300ms for an order of magnitude decrease. Part of the solution is to remove stop_machine() calls from the offlining path. There are multiple reasons for making hotplug work better, McKenney said, including improving read-copy update (RCU), reducing realtime disruption, and providing a low-cost way to clear things off of a CPU for a short time. He also noted that it is not an ARM-only problem that is being solved here, as x86 suffers from significant hotplug delays too.

The session finished up with a brief discussion of how to describe the architecture of a big.LITTLE system to the kernel. Currently, each platform has its own way of describing the processors and caches in its header files, but a more general way, perhaps using device tree or some kind of runtime detection mechanism, is desired.

Comments (none posted)

KS2012: ARM: DMA issues

By Jake Edge
September 6, 2012
2012 Kernel Summit

Generic DMA engines are present in many ARM platforms to enable devices to move data between main memory and device-specific regions. Arnd Bergmann led a discussion about the DMA engine APIs as part of the last day of the ARM minisummit. DMA is the last ARM subsystem that does not have generic device tree bindings, he said, so he hoped the assembled developers could agree on some. Without those bindings, the code that uses DMA is forced to be platform-specific, which impedes progress toward the goal of building a single kernel image for multiple ARM platforms.

Bergmann said that there are many things currently blocked by the lack of device tree bindings for DMA. Those bindings need to describe the kinds of DMA channels available in the hardware, along with their attributes. Two proposals have been made to add support for the generic DMA engines. Jon Hunter has a patch set that implements a particular set of bindings, but he couldn't attend the meeting, so Bergmann presented them. The other patches were from DMA engine maintainer Vinod Koul.

The differences between the two are a bit hard to decipher. Both approaches attempt to keep any information about how to set up DMA channels from both the device driver using them and from the DMA engine driver that provides them. That knowledge would reside in the DMA engine core. With Koul's patches, there would be a global lookup table that would be populated by the platform-specific code from various sources (device tree, ACPI, etc.). That table would list the connections between devices and DMA engine drivers. Hunter's patches solve the problem simply for the device tree case, without requiring interaction with the platform-specific code.

The discussion got technically quite deep, as Bergmann admitted with a grin after the session, but the upshot is that the two approaches are not completely at odds. At the end of the session, it was agreed that both patches could be merged ("more or less", Koul said). The DMA engine core would be able to find the connection in either the device tree or via the lookup table, but will use the same device driver interfaces either way. Bergmann said that he hoped to see something in the 3.7 kernel. In between those two discussions, some things about the device tree bindings were hammered out as well.

One of the first problems noted with the bindings described in Hunter's patch was the use of numerical values (derived from flag bits) to describe attributes of DMA channels. "These magic numbers are not a readability triumph", Mark Brown said. He went on to suggest adding some kind of preprocessor support to the device tree compiler (dtc), which turns the text representation into a flattened device tree binary (dtb). That would make the flags readable, Tony Lindgren said, but he wondered if such a preprocessor was "years off".

One way around the magic number problem is to use names instead, though dealing with strings in device tree is difficult, Bergmann said. Some platforms have complicated arrangements of controllers and DMA engines, he said, using an example of an MMC (memory card) controller with two channels, one of which is connected to three different DMA engines. In order to make the request API for a DMA channel relatively simple, it would make sense to name each channel, someone suggested. One problem there is that most devices (80% perhaps) either have a single channel or just one for each direction, Bergmann said. Forcing those devices to explicitly name them adds complexity.

But most were in favor of using the names. In addition to naming the channels, standardizing the property names would make it easier to scan the whole device tree for properties of interest. Allowing devices to come up with their own property names will make that impossible. Also, when new functional units that implement DMA get added to a platform, standardized names will make it easier to incorporate them into existing device trees. So, names for each of a device's channels, along with a standard set of property names, would seem to be in the cards.

This was the last non-hacking session in the ARM minisummit, which seemed to be a great success overall. Some issues that had been lingering were discussed and resolved—or at least plans to do so were made. In addition, the status of some newer features (e.g. big.LITTLE and AArch64) was presented, so that questions could be raised and answered in real time, rather than over a sometimes slow mailing list or IRC channel pipe. Beyond the discussions, both afternoons featured hacking sessions where it sounds like some real work got done.

[ I would like to thank Will Deacon and Arnd Bergmann for reviewing parts of the ARM minisummit coverage, though any remaining errors are mine, of course. ]

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds