Brief items
The current development kernel is 3.6-rc4,
released on September 1. "
Shortlog
appended, as you can see it's just fairly random. I'm hoping we're entering
the boring/stable part of the -rc windows, and that things won't really
pick up speed just because people are getting home."
Stable updates: no stable updates have been released in the last
week, and none are in the review process as of this writing.
Comments (none posted)
As every parent knows, a tidy bedroom is very different from a
messy one. The number of items in the room may be exactly the same,
but the difference between orderly and disorderly arrangements is
immediately apparent. Now imagine a house with millions of rooms,
each of which is either tidy or messy. A robot in the house can
inspect each room to see which state it is in. It can also turn a
tidy room into a messy one (by throwing things on the floor at
random) and a messy room into a tidy one (by tidying it up). This,
in essence, is how a new class of memory chip works.
—
The Economist on
phase-change memory
"RFC" always worries me. I read it as "Really Flakey Code"
—
Andrew Morton
Sorry for the late response, was too busy drinking with other
kernel developers in San Diego and laughing at all you that are
still doing real work.
—
Steven Rostedt
Yes I have now read kernel bugzilla, every open bug (and closed
over half of them). An interesting read, mysteries that Sherlock
Holmes would puzzle over, a length that wanted a good editor
urgently, an interesting line in social commentary, the odd bit of
unnecessary bad language. As a read it is however overall not well
explained or structured.
—
Alan Cox
Comments (2 posted)
Kernel development news
The "regression testing" slot on day 1 of the 2012 Kernel Summit consisted
of presentations from Dave Jones and Mel Gorman. Dave's presentation
described his new fuzz testing tool, while Mel's was concerned with some
steps to improve benchmarking for detecting regressions.
Trinity: intelligent fuzz testing
Dave Jones talked about a testing tool that he has been working on for
the last 18 months. That tool, Trinity, is a type of
system call fuzz
tester. Dave noted that fuzz testing is nothing new, and that the Linux
community has had fuzz testing projects for around a decade. The problem is
that past fuzz testers take a fairly simplistic approach, passing random
bit patterns in the system call arguments. This suffices to find the really
simple bugs, for example, detecting that a numeric value passed to a file
descriptor argument does not correspond to a valid open file descriptor.
However, once these simple bugs are fixed, fuzz testers tend to simply
encounter the error codes (EINVAL, EBADF, and so on) that
system calls (correctly) return when they are given bad arguments.
What distinguishes Trinity is the addition of some domain-specific
intelligence. The tool includes annotations that describe the
arguments expected by each system call. For example, if a system call
expects a file descriptor argument, then rather than passing a random
number, Trinity opens a range of different types of files, and passes the
resulting descriptors to the system call. This allows fuzz testing to get
past the simplest checks performed on system call arguments, and find
deeper bugs. Annotations are available to indicate a range of argument
types, including memory addresses, pathnames, PIDs,
lengths, and so on. Using these annotations, Trinity can generate tests
that are better targeted at the argument type (for example, the Trinity web
site notes that powers of two plus or minus one are often effective for
triggering bugs associated with "length" arguments). The resulting tests
performed by Trinity are consequently more sophisticated than traditional
fuzz testers, and find new types of errors in system calls.
Ted Ts'o asked whether it's possible to bias the tests performed by Trinity
in favor of particular kernel subsystems. In response, Dave noted that
Trinity can be directed to open the file descriptors that it uses for
testing off a particular filesystem (for example, an ext4 partition).
Dave stated that Trinity is run regularly against the
linux-next tree as well as against Linus's tree. He noted that
Trinity has found
bugs in the networking code, filesystem code, and many other parts of
the kernel. One of the goals of his talk was simply to encourage other
developers to start employing Trinity to test their subsystems and
architectures. Trinity currently supports the x86, ia64, powerpc, and sparc
architectures.
Benchmarking for regressions
Mel Gorman's talk slot was mainly concerned with improving the
discovery of performance regressions. He
noted that, in the past, "we talked
about benchmarking for patches when they get merged. But there's been much
inconsistency over time." In particular, he called out the practice
of writing commit changelog entries that simply give benchmark statistics
from running a particular benchmarking tool as being nearly useless for
detecting regressions.
Mel would like to see more commit changelogs that provide enough
information to perform reproducible benchmarks. Leading by example, Mel
uses his own benchmarking framework, MMTests, and he has posted
historical results from kernels 2.6.32 through to 3.4. What he would like to
see is changelog entries that, in addition to giving benchmark results,
identify the benchmark framework they use and include (pointers to) the
specific configuration used with the framework. (The configuration could be
in the changelog, or if too large, it could be stored in some reasonably
stable location such as the kernel Bugzilla.)
H. Peter Anvin responded that "I hope you know how hard it
is for submitters to give us real numbers at all." But this didn't
deter Mel from reiterating his desire for sufficient information to
reproduce benchmarking tests; he noted that many regressions take a long
time to be discovered, which increases the importance of being able to
reproduce past tests.
Ted Ts'o observed that there seemed to be a need for a per-subsystem
approach to benchmarking. He then asked whether individual subsystems would
even be able come to consensus on what would be a reasonable set of
metrics, and noted that those metrics should not take too long to run
(since metrics that take a long time to execute are likely not to executed
in practice). Mel offered that, if necessary, he would volunteer to help
write configuration scripts for kernel subsystems. From there, discussion
moved into a few other related topics, without reaching any firm
resolutions. However, performance regressions are a subject of great
concern to kernel developers, and the topic of reproducible benchmarking is
one that will likely be revisited soon.
Comments (none posted)
The "distributions and upstream" session of day 1 of the 2012 Kernel
Summit focused on a question enunciated by Ted Ts'o: "From an
upstream perspective, how can we better help distros?" Responding to
that question were two distributor representatives: Ben Hutchings for
Debian and Dave Jones for Fedora.
Ben Hutchings asked that, when considering merging a new feature,
kernel developers not accept the argument that "this feature
is expensive, but that's okay because we'll make it an option". He
pointed out that this argument is based on a logical fallacy, since in nearly
every case distributions will enable the option, because some
users will need it. As an example, Ben mentioned memory cgroups (memcg), which, in their
initial release, were rather expensive for performance.
A second point that Ben made was that there are still features that
distributions are adding that are not being merged upstream. As an example
from last year, he mentioned Android. As a current example, he noted the union mounts feature, which is still not
upstream. Inasmuch as keeping features such as these outside of the
mainline kernel creates more work for distributions, he would like to see
such features more actively merged.
Dave Jones made three points. The first of these was that a lot of
Kconfig help texts are "really awful". As a
consequence, distribution maintainers have to read the code in order to
work out if a feature should be enabled.
Dave's second point is that it would be useful to have an explicit list
of regressions at around the -rc3 or -rc4 point in the release
cycle. His problem is that regressions often become visible only much
later. Finally, Dave noted that Fedora sees a lot of reports from
lockdep that no other distributions seem to
see. The basic problem underlying both of these points is of course lack of
early testing, and at this point Ted Ts'o mused: "can we make it
easier for users to run the kernel-of-the-day [in particular, -rc1 and rc2
kernels] and allow them to easily fall back to a stable kernel if it
doesn't work out?" There was however no conclusive response in the ensuing discussion.
Returning to the general subject of Kconfig, Matthew Garrett
echoed and elaborated on one of points made by Ben Hutchings, noting that
Kconfig is important for kernel developers (so that they can strip
down a kernel for fast builds). However, because distributors will nearly
always enable configuration options (as described above), kernel developers
need to ask themselves, "If you don't expect an option to be enabled
[by distributors], then why is the option even present?". In
passing, Andrea Arcangeli noted one of his pet irritations—one with
which most people who have ever built a kernel will be familiar. When
running make oldconfig, it is very easy to overstep as one types
Enter to accept the default "no" for most options; one suddenly
realizes that the answer to an earlier question should have been "yes". At
that point of course, there is no way to go back, and one must instead
restart from the beginning. (Your editor observes that improving this small
problem could be a nice way for a budding kernel hacker to get their hands
dirty.)
Comments (19 posted)
The lightning talks on day 1 of the 2012 Kernel Summit were over in, one
could say, a flash. There were just two very brief discussions.
Paul McKenney noted that a small number of read-copy update (RCU) users
have for some time requested the ability to offload RCU callbacks. Normally, RCU callbacks are
invoked on the CPU that registered them. This works well in most cases,
but it can result in unwelcome variations in the execution times of user
processes running on the same CPU. This kind of variation (also known as
operating system jitter) can be reduced by offloading the callbacks—arranging for
that CPU's RCU callbacks to be invoked on some other CPU. Paul asked if
the ability to offload RCU callbacks was of interest to others in the room. A number of developers responded in the affirmative.
Dan Carpenter noted the existence of Smatch, his static analysis tool
that detects various kinds of errors in C source code, pointing out that by
now "many of you have received emails from me". (The emails
that he referred to contained kernel patches and lists of bugs or potential
bugs in kernel code. In the summary of
his LPC 2011 presentation, Dan noted that Smatch has resulted in
hundreds of kernel patches.) Dan's main point was simply to request other
ideas from kernel developers on what checks to add to Smatch; he noted that
there is a mailing list, smatch@vger.kernel.org, to which suggestions
can be sent.
Comments (none posted)
The presentation given by Fengguang Wu on day 1 of the 2012 Kernel
Summit was about testing for build and boot regressions in the Linux
kernel. In the presentation, Fengguang described the test framework that he
has established to detect and report these regressions in a more timely
fashion.
To summarize the problem that Fengguang is trying to resolve, it's
simplest to look at things from the perspective of a maintainer making
periodic kernel releases. The most obvious example is of course the
mainline tree maintained by Linus, which goes through a series of release
candidates on the way to the release of a stable kernel. The
linux-next tree maintained by Stephen Rothwell is another
example. Many other developers depend on these releases. If for some
reason, those kernel releases don't successfully build and boot, then the
daily work of other kernel developers is impaired while they resolve the
problem.
Of course, Linus and Stephen strive to ensure that these kinds of build
and boot errors don't occur: before making kernel releases, they do local
testing on their development systems, and ensure that the kernel builds,
boots, and runs for them. The problem comes in when one considers the
variety of hardware architectures and configuration options that Linux
provides. No single developer can test all combinations of architectures
and options, which means that, for some combinations, there are inevitably
build and boot errors in the mainline -rc and linux-next
releases. These sorts of regressions appear even in the final releases
performed by Linus; Fengguang noted the results found by Geert
Uytterhoeven, who reported that
(for example) in the Linux 3.4 release, his testing found around 100 build
error messages resulting from regressions. (Those figures are exaggerated
because some errors occur on obscure platforms that see less maintainer
attention. But they include a number of regressions on mainstream platforms
that have the potential to disrupt the work of many kernel developers.)
Furthermore, even when a build problem appears in a series of kernel
commits but is later fixed before a mainline -rc release, this
still creates a problem: developers performing bisects to discover the
causes of other kernel bugs will encounter the build failures during the
bisection process.
As Fengguang noted, the problem is that it takes some time for these
regressions to be detected. By that time, it may be difficult to determine
what kernel change caused the problem and who it should be reported
to. Many such reports on the kernel mailing list get no response, since it
can be hard to diagnose user-reported problems. Furthermore, the developer
responsible for the problem may have moved on to other activities and may
no longer be "hot" on the details of work that they did quite some time
ago. As a result, there is duplicated effort and lost time as the affected
developers resolve the problems themselves.
According to Fengguang, these sorts of regressions are an inevitable
part of the development process. Even the best of kernel developers may
sometimes fail to test for regressions. When such regressions occur,
the best way to ensure they are resolved is to quickly and accurately
determine the cause of the regression and promptly notify the developer who
caused the regression.
Fengguang's solution to this problem is to automate a solution that
detects these regressions and then informs kernel developers by email that
their commit X triggered bug Y. Crucially, the email reports are generated
nearly immediately (1-hour response time) after commits are merged into the
tested repositories. (For this reason, Fengguang calls his system a "0-day
kernel test" system.) Since the relevant developer is informed quickly,
it's more likely they'll be "hot" on the technical details, and able to fix
the problem quickly.
Fengguang's test framework at the Intel Open Source Technology Center
consists of a server farm that includes five build servers (three Sandy
Bridge and two Itanium systems). On these systems, kernels are built inside
chroot jails. The built kernel images are then boot tested inside over 100
KVM instances on another eight test boxes. The system builds and boots
each tested kernel configuration, on a commit-by-commit basis for a range
of kernel configurations. (The system reuses build outputs from previous
commits so as to expedite the build testing. Thus, the build time for the
first commit of an allmodconfig build is typically ten minutes,
but subsequent commits require two minutes to build on average.)
Tests are currently run against Linus's tree, linux-next, and
more than 180 trees owned by individual kernel maintainers and
developers. (Running tests against individual maintainers trees helps
ensure that problems are fixed before they taint Linus's tree and
linux-next.) Together, these trees produce 40 new branch heads and
400 new commits on an average working day. Each day, the system build
tests 200 of the new commits. (The system allows trees to be categorized as
"rebasable" or "non-rebasable". The latter are usually big subsystem trees
for which the maintainers take responsibility to do bisectability tests
before publishing commits. Rebaseable trees are tested on a
commit-by-commit basis. For non-rebaseable trees, only the branch head is
built; only if that fails does the system go though the intervening commits
to locate the source of the error. This is why not all 400 of the daily
commits are tested.)
The current machine power allows the build test system
to test 140 kernel configurations (as well as running sparse and coccinelle) for each commit. Around
half of these configurations are randconfig, which are regenerated
each day in order to increase test coverage over time.
(randconfig builds the kernel with randomized configuration
options, so as to find test unusual kernel configurations.) Most of the
built kernels are boot tested, including the randconfig ones.
Boot tests for the head commits are repeated multiple times to increase the
chance of catching less-reproducible regressions. In the end, 30,000
kernels are boot tested in each day. In the process, the system catches 4
new static errors or warnings per day, and 1 boot error every second day.
The responses from the kernel developers in the room were extremely
positive to this new system. Andrew Morton noted he'd received a number of
useful reports from the tool. "All contained good information, and
all corresponded to issues I felt should be fixed." Others echoed
Andrew's comments.
One developer in the room asked what he should do if he has a scratch
branch that is simply too broken to be tested. Fengguang replied that his
build system maintains a blacklist, and specific branches can be added to
that blacklist on request. In addition, a developer can include a line
containing the string Dont-Auto-Build in a commit message; this
causes the build system to skip testing of the whole branch.
Many problems in the system have already been fixed as a consequence of
developer feedback: the build test system is fairly mature; the boot test
system is already reasonably usable, but has room for further
improvement. Fengguang is seeking further input from kernel developers on
how his system could be improved. In particular, he is asking kernel
developers for runtime stress and functional test scripts for their
subsystems. (Currently the boot test system runs a limited set of
tools—trinity, xfstests,
and a handful of memory management tests—for catching runtime
regressions.)
Fengguang's system has already clearly had a strong positive impact on
the day-to-day life of kernel developers. With further feedback, the system
is likely to provide even more benefit.
Comments (5 posted)
Anyone who has paid even slight attention to the progress of the
mainlining of the Android modifications to the Linux kernel will be aware
that the process has had its ups and downs. An initial attempt to mainline
the changes via the staging tree ended in
failure when the code was removed in kernel 2.6.33 in late
2010. Nevertheless, at the 2011 Kernel Summit, kernel developers indicated a willingness to
mainline code from Android, and starting with
Linux 3.3, various Android pieces were brought back into the staging
tree. (On the Android side this was guided by the Android Mainlining
Project.) The purpose of John Stultz's presentation on day 1 of the
2012 Kernel Summit was to review the current status of upstreaming of the
Android code and outline the work yet to be done.
John began by reviewing the progress in recent kernel releases. Linux
3.3 reintroduced a number of pieces to staging, including ashmem, binder, logger, and the low-memory killer. With the Linux 3.3 release,
it became possible to boot Android on a vanilla kernel. Linux 3.4 added
some further pieces to the staging tree and also saw a lot of cleanup of
the previously merged code. Subsequent kernels have seen further Android
code move to the staging tree, including the wakeup_source feature and the Android Gadget
driver. In addition, some code in the staging tree has been converted
to use upstream kernel features; for example, Android's alarm-dev
feature was converted to use the alarm timers
feature added to Linux in kernel 3.0.
As of now (i.e., after the closure of the 3.6 merge window), there
still remain some major features to merge, including the ION memory allocator. In addition, various
Android pieces still remain in the staging tree (for example, the
low-memory killer, ashmem, binder, and logger), and these need to be
reworked (or replaced), so that the equivalent functionality is provided
in the mainline kernel. However, one has the impression that these
technical issues will all be solved, since there's been a general
improvement in relations on both sides of the Android/upstream fence; John
noted that these days there is much less friction between the two sides,
more Android developers are participating in the Linux community, and the
Linux community seems more accepting of Android as a
project. Nevertheless, John noted a few things that could still be
improved on the Android side. In particular, for many releases, the Android
developers provided updated code branches for each kernel release, but in
more recent times they have skipped doing this for some kernel releases.
Following John's presentation, there was relatively little discussion,
which is perhaps an indication of the fact that kernel developers are
reasonably satisfied with the current status and momentum of Android
upstreaming. Matthew Garrett asked if John has any feeling about whether
other projects are making use of the upstreamed Android code. In response,
John noted that Android code is being used as the default Board Support
Package for some projects, such as Firefox OS. He also
mentioned that the volatile ranges code
that he is currently developing has a number of potential uses outside of
Android.
Matthew was also curious to know if is there anything that the Linux
kernel developers could do to help make the design process for features
that are going into Android more open. Right now, most Android features are
developed in-house, but perhaps a more open-developed solution might have
satisfied other users' requirements. There was some back and forth as to
how practical any other kind of model would be, especially given the focus
of vendors on product deadlines; the implicit conclusion was that
anything other than the status quo was unlikely.
Overall, the current status of Android upstreaming is very
positive, and certainly rather different from the situation a couple of
years ago.
Comments (2 posted)
From several accounts, day one of this year's Kernel Summit was largely
argument-free. There were plenty of discussions, even minor disagreements, but
nothing approaching some of the battles of yore. Day three looked like it
might provide an exception to that pattern with a discussion of two
different patch sets that are both targeted at cryptographically signing
kernel modules. In the end, though, the pattern continued, with an
interesting, but tame, session.
Kernel modules are inserted into the running kernel, so a rogue module
could be used to compromise the kernel in ways that are hard to detect.
One way to prevent that from happening is to require that kernel modules be
cryptographically signed using keys that are explicitly allowed by the
administrator. Before loading the module, the kernel can check the
signature and refuse to load any that can't be verified. Those modules
could come from a distribution or be built
with a custom kernel. Since modules can be loaded based on a user action
(e.g. attaching a device or using a new network protocol) or come from a
third-party (e.g. binary kernel modules), ensuring that
only approved modules can be loaded is a commonly requested feature.
Rusty Russell, who maintains the kernel module subsystem, called the
meeting to try to determine how to proceed on
module signing. David Howells has one patch
set that is based on what has
been in RHEL for some time, while Dmitry Kasatkin posted another that uses the digital signature
support added to the kernel for integrity management. Howells's patches
have been
around, in various forms, since 2004, while Kasatkin's are relatively new.
Russell prefaced the discussion with an admonishment that he was not
interested in discussing the "politics, ethics, or morality" of module
signing. He
invited anyone who did want to debate those topics to a meeting at 8pm,
which was shortly after he had to leave for his plane. The reason we will
be signing modules, he said, is because Linus Torvalds wants to be able to
sign his modules.
Kasatkin's approach would put the module signature in the extended attributes
(xattrs) of the module file, Russell began, but Kasatkin said that choice
was only a convenience. His patches are now independent of the integrity
measurement architecture (IMA) and the extended verification module (EVM),
both of which use xattrs. He originally used xattrs because of the IMA/EVM
origin of the signature code he is using, and he did not want to
change the module contents. Since then, he noted a response from Russell
to Howells's approach and has changed his patches to add the module
signature to the
end of the file.
That led Russell into a bit of a historical journey. The original patches
from Howells put the signature into an ELF section in the module file.
But, because there was interest in having the same signature on both
stripped and unstripped module files, there was a need to skip over some
parts of the module file when calculating the hash that goes into the
signature.
The amount of code needed to parse ELF was "concerning", Russell said.
Currently, there are some simple sanity checks in the module-loading code,
without any checks for malicious code because the belief was that you had
to be root to load a module. While that is still true, the advent of
things like secure boot and IMA/EVM has made checking for malicious code
a priority. But Russell wants to ensure that the code doing that checking
is as simple as possible to verify, which was not true when putting module
signatures into ELF sections.
Greg Kroah-Hartman pointed out that you have to do ELF parsing to load the
module anyway. There is a difference, though. If the module is being
checked for maliciousness, that parsing happens after the signature
is checked. Any parsing that is done before that verification is
potentially handling
untrusted input.
Russell would rather see the signature appended to the module file in some
form. It could be a fixed-length signature block, as suggested by
Torvalds, or there could be some kind of "magic string" followed by a
signature. That would allow for multiple signatures on a module. Another
suggestion was to change the load_module() system call so that the
signature was passed in, which would "punt" the problem to user space "that
I don't maintain anymore", Russell said.
Russell's suggestion was to just do a simple backward search from the end
of the module file to find the magic string, but Howells was not happy with
that approach for performance reasons. Instead, Howells added a 5-digit
ASCII number for the length of the signature, which Russell found a bit
inelegant. Looking for the magic string "doesn't take that long", he said,
and module loading is not that performance-critical.
There were murmurs of discontent in the room about that last statement.
There are those who are very sensitive about module loading times because
it impacts boot speed. But, Russell said that he could live with ASCII
numbers, as long as there was no need to parse ELF sections in the
verification code. He does like the fact that modules can be signed in the
shell, which is the reason behind the ASCII length value.
There are Red Hat customers asking for SHA-512 digests signed with 4K RSA
keys, Howells said, but that may change down the road. That could
make picking a size for a fixed-length signature block difficult. But, as
Ted Ts'o pointed out, doing a search for the magic string is in the noise
in comparison to doing RSA with 4K keys. The kernel crypto subsystem can
use hardware acceleration to make that faster, Howells said. But, Russell
was not convinced that the performance impact of searching for the magic
string was
significant and would like to see some numbers.
James Bottomley asked where the keys for signing would come from. Howells
responded that the kernel build process can create a key. The public part
would go into the kernel for verification purposes, while the private part
would be used for signing. After the signing is done, that ephemeral
private key
could be discarded. There is also the option to specify a key pair to use.
Torvalds said that it was "stupid" to have stripped modules with the same
signature as the unstripped versions. The build process should just
generate signatures for both. Having logic to skip over various pieces of
the module just adds a new attack point. Another alternative is to only
generate signatures for the stripped modules as the others are only used
for debugging and aren't loaded anyway, so they can be unsigned, he said.
Russell agreed, suggesting that the build process could just call out to
something to do the signing.
For binary modules, such as the NVIDIA graphics drivers, users would have
to add the NVIDIA public key to the kernel ring, Peter Jones said.
Kees
Cook brought up an issue that is, currently at least, specific to Chrome
OS. In Chrome OS, there is a trusted root partition, so knowing the origin
of a module would allow those systems to make decisions about whether or
not to load them. Right now, the interface doesn't provide that
information, so Cook suggested changing the load_module() system call (or adding a new
one) that passed a file descriptor for the module file. Russell agreed
that an additional
interface was probably in order to solve that problem.
In the end, Russell concluded that there was a reasonable amount of
agreement about how to approach module signing. He planned to look at the
two patch sets, try to find the commonality
between the two, and "apply something". In fact, he made a proposal, based partly on Howells's approach, on
September 4. It appends the signature to the module file after a magic
string as Russell has been advocating. As he said when wrapping up the
discussion, his patch can
provide a starting point to solving this longstanding problem.
Comments (11 posted)
Catalin Marinas led a discussion of kernel support for 64-bit ARM
processors as part of day two of the ARM minisummit. He concentrated
on
the status of the in-flight patches to add that support, while pointing to
his LinuxCon
talk later in the week for more details about the architecture
itself.
A second round of the ARM-64 patches was posted to the linux-kernel
mailing list in mid-August. After some complaints about the "aarch64" name
for the architecture, it was changed to "arm64", at least for the kernel
source directory. That name will really
only be seen by kernel developers as uname will still report
"aarch64", in keeping with the ELF triplet used by the binaries built with GCC.
Some of the lessons learned from the ARM 32-bit support have been reflected
in arm64. It will target a single kernel image by default, for
example. That means that device tree support is mandatory for AArch64
platforms. Since there are not, as yet, any AArch64 platforms, the patches
contain simplified platform code based on that of the Versatile Express.
There are two targets for AArch64 devices: embedded and server. It is
possible that ACPI support will be required for the servers. As far as
Marinas knows, there is no ACPI implementation out there, but it is not
clear what Microsoft is doing in that area.
The code for generic timers and the generic interrupt controller (GIC)
lives under the drivers directory. That code could be shared with
arch/arm, but there is a need to #ifdef the inline assembly
code.
There is an intent to push back on the system-on-a-chip (SoC) vendors
regarding things like firmware initialization, boot protocol, and a
standardized secure mode API. SoC vendors (and thus, their
ARM sub-trees) should be providing the standard interfaces, rather than
heading out on their own. The ARM maintainers can choose not to accept
ports that do not conform.
That may work for devices targeted at Linux, but there may be SoC vendors
who initially target another operating system, as Olof Johannson noted.
There will
likely need to be some give and take for things such as the boot protocol when
Windows, iOS, or OS X targeted devices are submitted. Marinas said
that the aim would be for standardization, but they "may have to cope" with
other choices at times.
The first code from SoC vendors is not expected before the end of the year,
Marinas said. Arnd Bergmann half-jokingly suggested that he would be happy
to get a
leaked version of that code at any time. The first SoCs might well just be
existing 32-bit ARMv7 SoCs with an AArch64 CPU (aka ARMv8) dropped in.
That may be the path for embedded applications, though the vendors
targeting the server market are likely to be starting from scratch.
That led to a discussion of how to push the arm64 patches
forward. Marinas would like to push the core architecture code forward,
while working to clean up the example SoC code. He would like to target
the 3.8 kernel for the core.
Bergmann was strongly in favor of getting it all into linux-next
soon, and targeting a merge for the 3.7 development cycle.
Marinas is concerned that including the SoC code will delay inclusion as it
will require more review. He also wants to make sure that there is a clean
base for those who want to use it
as a basis for their own SoC code. That should take two weeks or so,
Marinas said. He hopes to get it into linux-next sometime after 3.7-rc1,
but Bergmann encouraged a faster approach. There is nothing very risky
about doing so, Johannson pointed out, as a new architecture cannot break
any existing code.
There is some concern about the 2MB limit on device tree binary
(dtb) files
because some
network controllers (and other devices) may have firmware blobs larger than
that. Bergmann noted that those blobs may not be able to be shipped in the
kernel, but could be put into firmware and loaded from there. It turns out
that the
flattened device tree format already has a length entry in its header that
can be used to support
multiple dtbs, which will allow the 2MB limit to be worked around.
The existing arm64 emulation does not have any DMA, so support for
that feature is currently untested. In addition, some SoCs are likely to
only support 32-bit DMA. Bergmann suggested an architecture-independent
implementation that used dma_ops pointers to provide both coherent
and non-coherent versions, but Marinas would like to do something simpler
(i.e. coherent only) to start with. Since the "hardware" currently lacks
DMA, "all DMA is coherent" seems like a reasonable model, Bergmann said.
Since no one will be affected by any bugs in the code, he suggested getting
it into linux-next as soon as possible.
Tony Lindgren asked if ARM maintainer Russell King had any comments on the
patches. Marinas said that there were not many, at least so far. Bergmann
said that he didn't think King was convinced that having a separate
arm64 directory (as opposed to adding 64-bit support to the
existing arm directory) was the right approach.
Many of the decisions were made for ARM 15 years ago, Marinas said, and
some of those make it messy to drop arm64 on top of arm.
Some day, when the arm tree only supports ARMv7, it may make sense
to merge with arm64. The assembly code cannot be shared, because
they are two different architectures, Bergmann said. In addition, the
system calls cannot be shared and the platform code is going to be done
very differently for arm64, he said.
But, there is room for sharing some things between the two trees, Marinas
said. That includes some of the device tree files, perf, the generic
timer, the GIC driver code, as well as KVM and Xen if and when they are
merged. In theory, the ptrace() and signal-handling code could be
shared as well.
Progress is clearly being made for arm64, and we will have to wait
and see how quickly it can make its way into the mainline.
Comments (none posted)
The ARM
big.LITTLE architecture
is an asymmetric multi-processor platform, with powerful and
power-hungry processors coupled with less-powerful (in both senses) CPUs using
the same instruction set. Big.LITTLE presents some challenges for the Linux scheduler. Paul McKenney gave a readout of the status of
big.LITTLE support at the ARM minisummit, which he really meant to serve as an
"advertisement" for
the scheduling micro-conference at the Linux Plumbers Conference that
started the next day.
The idea behind big.LITTLE is to do frequency and voltage scaling by other
means, he said. Because of limitations imposed by physics, there is a floor to
frequency and voltage scaling on any given processor, but that can be
worked around by adding another
processor with fewer transistors. That's what has been done with big.LITTLE.
There are basically two ways to expose the big.LITTLE system to Linux. The
first is to treat each pair as a single CPU, switching between them "almost
transparently". That has the advantage that
it requires almost no changes to the kernel and
applications don't know that anything has changed. But, there is a delay
involved in making the switch, which isn't taken into account by the power
management code, so the power savings aren't as large as they could be. In
addition, that approach requires paired CPUs (i.e. one of each size), but some
vendors are interested in having one little and many big CPUs in their
big.LITTLE systems.
The other way to handle big.LITTLE is to expose all of the processors to
Linux, so that the scheduler can choose where to run its tasks. That
requires more knowledge of the behavior of processes, so Paul Turner has a
patch set that gathers that kind of
information. Turner said
that the scheduler currently takes averages on a per-CPU basis, but when
processes move between CPUs, some information is lost. His changes cause
the load average to move with the processes, which will allow the scheduler
to make better decisions.
Turner's patches are on their third revision, and have been "baking on our
systems at Google" for a few months. There are no real to-dos outstanding,
he said. Peter Zijlstra said that he had wanted to merge the previous
revision, but that there was "some funky math" in the patches, which has
since been changed. Turner said that he measured a 3-4% performance
increase using the patches, which means we get "more accurate tracking at
lower cost". It seems likely that the patches will be merged soon.
McKenney said that Turner's patches have been adapted by Morten Rasmussen
to be used on
big.LITTLE systems. The measurements are used to try to determine where a
task should be run. Over time, though, the task's behavior can change, so
the scheduler checks to see if that has happened and if the placement still
makes sense. There are still questions about when "race to idle" versus
spreading tasks around makes the most sense, and there have been some
related discussions of that recently on the linux-kernel mailing list.
Currently, the CPU hotplug support is less than ideal for removing CPUs
that have gone idle. But Thomas Gleixner is reworking things to "make
hotplug suck less", McKenney said. For heavy workloads, the process of
offlining a processor can take multiple seconds. After Gleixner's rework,
that drops to 300ms for an order of magnitude decrease. Part of the
solution is to remove stop_machine() calls from the offlining
path. There are multiple reasons for making hotplug work better, McKenney
said, including improving read-copy update (RCU), reducing realtime
disruption, and providing a low-cost
way to clear things off of a CPU for a short time. He also noted that it
is not an ARM-only problem that is being solved here, as x86 suffers from
significant hotplug delays too.
The session finished up with a brief discussion of how to describe the
architecture of a big.LITTLE system to the kernel. Currently, each
platform has its own way of describing the processors and caches in its
header files, but a more general way, perhaps using device tree or some
kind of runtime detection mechanism, is desired.
Comments (none posted)
Generic DMA engines are present in many ARM platforms to enable devices to
move data between main memory and device-specific regions. Arnd Bergmann led
a discussion about the DMA engine APIs as part of the last day of the ARM
minisummit. DMA is the last ARM subsystem that
does not have generic device tree bindings, he said, so he hoped the
assembled developers could agree on some. Without those bindings, the code
that uses DMA is forced to be platform-specific, which impedes progress
toward the goal of building a single kernel image for multiple ARM platforms.
Bergmann said that there are many things currently blocked by the lack of
device tree bindings for DMA. Those bindings need to describe the kinds of
DMA channels available in the hardware, along with their attributes. Two
proposals have been made to add support for the generic DMA engines. Jon
Hunter has a patch set that implements a particular set of bindings, but he
couldn't attend the meeting, so Bergmann presented them. The
other patches were from DMA engine maintainer Vinod Koul.
The differences between the two are a bit hard to decipher.
Both approaches attempt to keep any information about how to set up DMA
channels from both the device driver using them and from the DMA engine
driver that provides them. That knowledge would reside in the DMA engine
core. With Koul's patches, there would be a global lookup table that would
be populated by the platform-specific code from various sources (device
tree, ACPI, etc.). That table would list the
connections between devices and DMA engine drivers. Hunter's patches solve
the problem simply for the device tree case, without requiring interaction
with the platform-specific code.
The discussion got technically quite deep, as Bergmann admitted with a grin
after the session, but the upshot is that the two approaches are not
completely at odds. At the end of the session, it was agreed that both
patches could be merged ("more or less", Koul said). The DMA engine core
would be able to find the connection in either the device tree or via
the lookup table, but will use the same device driver interfaces either way.
Bergmann said that he
hoped to see something in the 3.7 kernel. In between those two discussions, some
things about the device tree bindings were hammered out as well.
One of the first problems noted with the bindings described in Hunter's
patch was the use of numerical values (derived from flag bits) to describe
attributes of DMA channels. "These magic numbers are not a readability
triumph", Mark Brown said. He went on to suggest adding some kind of
preprocessor support to the device tree compiler (dtc), which turns the
text representation into a flattened device tree binary (dtb). That
would make the flags readable, Tony Lindgren said, but he wondered if such
a preprocessor was "years off".
One way around the magic number problem is to use names instead, though
dealing with strings in device tree is difficult, Bergmann said. Some
platforms have complicated arrangements of controllers and DMA engines,
he said, using an example of an MMC (memory card) controller with two
channels, one of
which is connected to three different DMA engines. In order to make the
request API for a DMA channel
relatively simple, it would make sense to name each channel, someone
suggested. One problem there is that most devices (80% perhaps) either
have a single channel or just one for each direction, Bergmann said.
Forcing those devices to explicitly name them adds complexity.
But most were in favor of using the names. In addition to naming the
channels, standardizing the property names would make it easier to scan the
whole device tree for
properties of interest. Allowing devices to come up with their own
property names will make that impossible. Also, when new functional units that
implement DMA get
added to a platform, standardized names will make it easier to incorporate
them into
existing device trees. So, names for each of a device's channels, along
with a standard set of property names, would seem to be in the cards.
This was the last non-hacking session in the ARM minisummit, which seemed
to be a great success overall. Some issues that
had been lingering were discussed and resolved—or at least plans
to do so were made. In addition, the status of some newer features
(e.g. big.LITTLE and
AArch64) was presented, so that questions could be raised and answered in
real time, rather than over a sometimes slow mailing list or IRC channel
pipe. Beyond the discussions, both afternoons featured hacking sessions
where it sounds like some real work got done.
[ I would like to thank Will Deacon and Arnd Bergmann for reviewing parts
of the ARM minisummit coverage, though any remaining errors are mine, of
course. ]
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>