Brief items
The current development kernel remains 3.2-rc2; no 3.2 prepatches
have been released in the last week.
Stable updates: the 3.0.10 and 3.1.2 stable kernel updates were released on
November 21.
They both include another set of important fixes, though this particular
update is smaller than many.
The
2.6.32.49,
3.0.11, and
3.1.3 stable updates are in the review
process as of this writing; they can be expected on or after
November 25.
Comments (2 posted)
Merging code in-flight, just because I can. What timezone should I
use?
--
Linus
Torvalds (thanks to Cesar Eduardo Barros)
Fedora is now getting so weird and dependant on magic initrds it's
become pretty unusable for kernel development work nowdays IMHO
because it's undebuggable.
--
Alan Cox
...our goal remains to provide a desktop which by default works for
the 99%, but that we also think that those we do like to tweak
their desktops should have an easy method to do so.
Note my wording: 99% excludes kernel hackers. Not sure if we should
say something about that explicitly or not.
--
Olav Vitters; prepare for "Occupy LKML"
in the near future
Comments (82 posted)
Kernel development news
By Jonathan Corbet
November 22, 2011
As a community, we are highly concerned with the quality of our code.
Kernel code is reviewed for functionality, long-term maintainability,
documentation, and more. Driver code is not always reviewed to the same
degree, but it can be just as important - if our drivers do not work, our
kernel does not work. There is an aspect to the long-term maintainability
of drivers that could use more attention: the degree to which a driver
documents how its hardware works.
One might argue that the job of documenting the hardware falls on whoever
writes the associated datasheet. There is some truth to that claim, but,
in many cases, only the original author of the driver has access to that
datasheet. Those who come after can try to
extract documentation from the vendor or to search for clandestine copies
hosted on the net. But often the only option is to figure out
the hardware from the one source of
information that is actually available: the existing driver. If the driver
source does not help that new developer, one can argue that the original
author has fallen down on the job.
So, if a driver contains code like:
writel(devp->regs[42], 0xf4ee0815);
it is missing something important. In the absence of the datasheet, there
is no way for any other developer to have any clue of what that operation
is actually doing.
The problem is worse than that, though; datasheets often omit useful
information, obscure the truth, and lie through their teeth. The hardest
part of getting a driver to work is often the process of figuring out what
the hardware's features and special needs really are. It often seems, for
example, that the datasheet is written before the process of designing the
hardware begins. As time passes, the understanding of the problem grows,
and deadlines loom, hardware engineers start to jettison features that
cannot be made to work in time or that, in their sole and not-subject-to-appeal
opinion, can be painlessly fixed in software. Updating the datasheet to
match the actual hardware never happens.
Thoughtful driver developers
will, on discovery of the imaginary nature of a specific hardware feature,
add a comment to the driver; that way, no future maintainer has to figure
out (the hard way, involving keyboard imprints on the forehead) why the
driver does not use a specific, helpful-looking hardware capability.
Then there is the matter of "reserved" bits. There has not yet been a
datasheet written that did not contain entries like:
Somewhere, deep within the company, there will be a maximum of two
engineers who know that the document is incomplete, but that nobody had
ever gotten around to updating it. If you can corner one of those people,
you can usually get them to admit that this bit should be documented as:
A developer who cannot get his hands within range of the neck of at least
one of those hardware engineers will likely spend a lot of time figuring
out that they need to set the "make it work" bit. This effort can involve
reverse-engineering proprietary drivers or, in cases of pure desperation,
playing with random bits to see what changes. Once that bit has been
located, it is natural for the tired and frustrated developer to quietly
set the bit before heading off in a determined effort to eliminate the
memory of the entire process through the application of large amounts of
beer. A particularly forward-thinking developer might make a note on a
printed version of the datasheet for future reference.
But handwritten notes are not usually helpful to the next developer who has
to work on that driver. A moment spent documenting that bit:
#define WTF_PRETTY_PLEASE 0x00020000 /* Always set this or it locks up */
may save somebody else hours of unnecessary pain.
It is tempting to think of a completed driver as being done. But driver
code, like other kernel code, is subject to ongoing change. Kernel API
changes must be dealt with, problems need to be fixed, and newer versions
of the hardware must be supported. Depending on how much beer was
involved, the original author may remember that device's peculiarities, but
those who follow will not. Everybody would be better served if the driver
did not just make the hardware work, but if it also made the reader
understand how the hardware works.
Doing so is not usually hard. Define descriptive names for registers,
bits, and fields rather than putting in hard-coded constants. Note
features that are incompletely described, incorrectly described, or entirely
science-fictional. Comment operations that have non-obvious ordering
requirements or that do not play well together. And, in general, code with
a great deal of sympathy for the people who will have to make changes to
your work in the future. Some hardware can never be properly documented
because the relevant information is simply not available; see this 2006 article for an example. But what
information is available should be made available to others.
Core kernel hackers are occasionally heard to make dismissive remarks
about driver developers and the work they do. But driver writers are often
given a difficult task involving a fair amount of detective work; they get
this task done and make our hardware work for us. Writing drivers that
adequately document the hardware is not an unreasonable thing to ask of
these developers; they have the hardware knowledge and the skills to do
it. The harder problem may be asking driver reviewers to insist
that this extra effort be made. Without pressure from reviewers, many
drivers will never enable readers to really understand what is going on.
Comments (16 posted)
By Jonathan Corbet
November 22, 2011
Classic x86-style processors are designed to fit into a mostly standardized
system architecture, so they all tend, in a general sense, to look alike.
One of the reasons why it is hard to make a general-purpose kernel for
embedded processors is the absence of this standardized architecture.
Embedded processors must be extensively configured, at boot time, to be able to
run the system they are connected to at all. The 3.1 kernel saw the
addition of the "pin controller" subsystem which is intended to help with
that task; enhancements are
on the way for (presumably) 3.2 as well. This article will provide a
superficial overview of how the pin controller works.
A typical system-on-chip (SOC) will have hundreds of pins (electrical
connectors) on it. Many of those pins have a well-defined purpose:
supplying power or clocks to the processor, video output, memory control,
and so on. But many of these pins - again, possibly hundreds of them -
will have no single defined purpose. Most of them can be used as
general-purpose I/O (GPIO) pins that can drive an LED, read the state of a
pushbutton, perform serial input or output, or activate an integrated pepper
spray dispenser. Some subsets of those pins can be organized into groups
to serve
as an I2C port, an I2S port, or to perform any of a number of other types of
multi-signal communications. Many of the pins can be configured with a
number of different electrical characteristics.
Without a proper configuration of its pins, an SOC will not function
properly - if at all. But the right pin configuration is entirely
dependent on the board the SOC is a part of; a processor running in one
vendor's handset will
be wired quite differently than the same processor in another vendor's
cow-milking machine. Pin configuration is typically done as part of the
board-specific startup code; the system-specific nature of that code
prevents a kernel built for one device from running on another even if the
same processor is in use. Pin configuration also tends to involve a lot of
cut-and-pasted, duplicated code; that, of course, is the type of code that
the embedded developers (and the ARM developers in particular) are trying to
get rid of.
The idea behind the pin control subsystem is to create a centralized
mechanism for the management and configuration of multi-function pins,
replacing a lot of board-specific code. This subsystem is quite thoroughly
documented in Documentation/pinctrl.txt.
A core developer would use the pin control code to describe a processor's
multi-function pins and the uses to which each can be put. Developers
enabling a specific board can then use that configuration to set up the
pins as needed for their deployment.
The first step is to tell the subsystem which pins the processor provides;
that is a simple matter of enumerating their names and associating each
with an integer pin number. A call to pinctrl_register() will
make those pins known to the system as a whole. The mapping of numbers to
pins is up to the developer, but it makes sense to, for example, keep a
bank of GPIO pins together to simplify coding later on.
One of the interesting things about multi-function pins is that many of
them can be assigned as a group to an internal functional unit. As a
simple example, one could imagine that pins 122 and 123 can be routed to an
internal I2C controller. Other types of ports may take more pins; an I2S
port to talk to a codec needs at least three, while SPI ports need four.
It is not generally possible to connect an arbitrary set of pins to any
controller; usually an internal controller has a very small number of
possible routings. These routings can also conflict with each other;
pin 77, say, could be either an I2C SCL line or an SPI SCLK line, but
it cannot serve both purposes at the same time.
The pin controller allows the developer to define "pin groups," essentially
named arrays of pins that can be assigned as a group to a controller.
Groups can (and often will) overlap each other; the pin controller will
ensure that overlapping groups cannot be selected at the same time. Groups
can be associated with "functions" describing the controllers to which they
can be attached. Some functions may have a single pin group that can be
used; others will have multiple groups.
There are some other bits and pieces (some glue to make the pin controller
work easily with the GPIO subsystem, for example), but the above describes
most of the functionality found in the 3.1 version of the pin controller.
Using this structure, board developers can register one or more
pinmux_map structures describing how the pins are actually wired
on the target system. That work can be done in a board file, or,
presumably, be generated from a device tree file. The pin controller will
use the mapping to ensure that no pins have been assigned to more than one
function; it will then instruct the low-level pinmux driver to configure
the pins as described. All of that work is now done in common code.
The pin multiplexer on a typical SOC can do a lot more than just assign a
pin to a specific function, though. There is typically a wealth of options
for each pin. Different pins can be driven to different voltages, for
example; they can also be connected to pull-up or pull-down resistors to
bias a line to a specific value. Some pins can be configured to detect
input signal changes and generate an interrupt or a wakeup event. Others
may be able to perform debouncing. It adds up to a fair amount of
complexity which is often reflected in the board-specific setup code.
The generic pin configuration interface,
currently in its third revision, attempts to bring the details of pin
configuration into the pin controller core. To that end, it defines 17 (at
last count) parameters that might be settable on a given pin; they vary
from the value of the pullup resistor to be used through slew rates for
rising or falling signals and whether the pin can be a source of wakeup
events. With this code in place, it should become possible to describe the
complete configuration of complex pin multiplexors entirely within the pin
controller.
The number of pin controller users in the 3.1 kernel is relatively small,
but there are a number of patches circulating to expand its usage. With
the addition of the configuration interface (in the 3.2 kernel, probably),
there will be even more reason to make use of it. One of the more
complicated bits of board-level configuration will be supported almost
entirely in common code, with all of the usual code quality and
maintainability benefits. It is hard to stick a pin into an improvement
like that.
Comments (5 posted)
By Jonathan Corbet
November 22, 2011
Caching plays an important role at almost all levels of a contemporary
operating system. Without the ability to cache frequently-used objects in
faster memory, performance suffers; the same idea holds whether one is
talking about individual cache lines in the processor's memory cache or
image data cached by a web browser. But caching requires resources; those
needs must be balanced with other demands on the same resources. In other
words, sometimes cached data must be dropped; often, overall performance
can be improved if the program doing the caching has a say in what gets
removed from the cache. A recent patch from John Stultz attempts to make
it easier for applications to offer up caches for reclamation when memory
gets tight.
John's patch takes a lot of inspiration
from the ashmem device
implemented for Android by Robert Love. But ashmem functions like a device
and performs its own memory management, which makes it hard to merge upstream.
John's patch, instead, tries to integrate things more deeply into the
kernel's own memory management subsystem. So it takes the form of a new
set of options to the posix_fadvise() system call. In particular,
an application can mark a range of pages in an open file as "volatile" with
the POSIX_FADV_VOLATILE operation. Pages that are so
marked can be discarded by the kernel if memory gets tight. Crucially,
even dirty pages can be discarded - without writeback - if they have been
marked volatile. This operation differs from POSIX_FADV_DONTNEED
in that the given pages will not (normally) be discarded right away - the
application might want the contents of volatile pages in the future,
but it will be able to recover if they disappear.
If a particular range of pages becomes useful later on, the application
should use the POSIX_FADV_NONVOLATILE operation to remove the
"volatile" marking. The return value from this operation is important: a
non-zero return from
posix_fadvise() indicates that the kernel has removed
one or more pages from the indicated range while it was marked volatile.
That is the only indication the application will get that the kernel has
accepted its offer and cleaned out some volatile pages. If those pages
have not been removed, posix_fadvise() will return zero and the
cached data will be available to the application.
There is also a POSIX_FADV_ISVOLATILE operation to query whether a
given range has been marked volatile or not.
Rik van Riel raised a couple of questions
about this functionality. He expressed concern that the kernel might
remove a single page of a multi-page cached object, thus wrecking the
caching while failing to reclaim all of the memory used to cache that
object. Ashmem apparently does its own memory management partially to
avoid this very situation; when an object's memory is reclaimed, all of it
will be taken back. John would apparently rather avoid adding another
least-recently-used list to the kernel, but he did respond that it might be
possible to add logic to reclaim an entire volatile range once a single
page is taken from that range.
Rik also worried about the overhead of this mechanism and proposed an
alternative that he has apparently been thinking about for a while. In
this scheme, applications would be able to open (and pass to
poll()) a special file descriptor that would receive a message
whenever the kernel finds itself short of memory. Applications would be
expected to respond by freeing whatever memory they can do without. The
mechanism has a certain kind of simplicity, but could also prove difficult
in real-world use. When an application gets a "free up some memory"
message, the first thing it will probably need to do is to fault in its
code for handling that message - an action which will require the
allocation of more memory. Marking the memory ahead of time
and freeing it directly from the kernel may turn out to be a more reliable
approach.
After the recent frontswap discussions, it
is perhaps unsurprising that nobody has dared to observe that volatile
memory ranges bear a more than passing resemblance to transcendent memory.
In particular, it looks a lot like "cleancache," which was merged in the
3.0 development cycle. There are differences: putting a page into
cleancache removes it from normal memory while volatile memory can remain
in place, and cleancache lacks a user-space interface. But the core idea
is the same: asking the system to hold some memory, but allowing that memory
to be dropped if the need arises. It could be that the two mechanisms
could be made to work together.
But, as noted above, nobody has mentioned this idea, and your editor would
certainly not be so daring.
One other question that has not been discussed is whether this code could
eventually replace ashmem, reducing the differences between the mainline
and the Android kernel. Any such replacement would not happen anytime
soon; ashmem has its own ABI that will need to be supported by Android
kernels for a long time. Over years, a transition to
posix_fadvise() could possibly be made if the Android developers
were willing to do so. But first the posix_fadvise() patch will
need to get into the mainline. It is a very new patch, so it is hard to
say if or when that might happen. Its relatively non-intrusive nature and
the clear need for this capability would tend to argue in its favor,
though.
Comments (13 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Page editor: Jonathan Corbet
Next page: Distributions>>