Brief items
The current development kernel is 3.0-rc5,
released on June 27. "
Nothing
terribly exciting here. The most noteworthy thing may be that only about a
quarter of the changes are in drivers, filesystem changes actually account
for more (40%): btrfs, cifs, ext4, jbd2, nfs are all present and accounted
for." Details can be found in
the
full changelog.
Stable updates: the 2.6.32.42, 2.6.33.15, and
2.6.39.2 updates were released on
June 23 with a long list of fixes. The 2.6.34.10 update - the first since April -
came out on June 26 with a very long list of fixes.
Comments (none posted)
As I have scratched my head for some times with rcu dynticks for
the nohz cpuset things, I guess that in the meantime my unconscious
mind developed the idea that rcu extended quiescent states were a
worldwide topic that every people talk about in dinner with their
family, that these even became the core stories of some
lullabies. I can still hear its chorus, that makes one switching to
idle peacefully...
--
Frederic Weisbecker
The V4L2 spec needs to be fixed with respect to error codes. Driver
authors are much more creative than DocBook authors.
--
Mauro Carvalho Chehab
The various POSIX_FADV_foo's are so ill-defined that it was a
mistake to ever use them. We should have done something overtly
linux-specific and given userspace more explicit and direct
pagecache control.
--
Andrew Morton
Comments (none posted)
The kickoff message for the 2011 Kernel Summit planning process has been
sent out. The event will be held October 24-26 in Prague, with some tweaks
to the longstanding formula. "
This year, the biggest change is that the
conference will be running three days, where the first day will be
dedicated to some kernel subsystem workshops. The second day will be
focused on development process issues and be more discussion oriented;
for this reason, it will be limited to core kernel developers picked
through a nomination and selection process which as in previous years.
The third day will be more presentation oriented (although hopefully we
will have some discussion); and all kernel summit workshop attendees
will be welcome attend the 3rd day." Now is the time for interested
folks to help shape the agenda.
Full Story (comments: none)
By Jonathan Corbet
June 29, 2011
Users of the Video4Linux2 API know that it is a rather complicated one,
involving some 91 different
ioctl() commands. The error-reporting
side of the API is much simpler, though; if something goes wrong, the
application is almost certain to get
EINVAL back. That error can
be trying to tell user space that the device is in the wrong state, that
some parameter was out of range, or, simply, that the requested command has
not been implemented. Needless to say, it can be hard for developers to
figure out what is really going on.
V4L2 maintainer Mauro Carvalho Chehab recently posted a patch to change the return code to
ENOIOCTLCMD in cases where the underlying driver has not actually
implemented the requested command. That change would at least distinguish
one set of problems - except that the VFS code silently translates
ENOIOCTLCMD to EINVAL before returning to user space.
So, from the point of view of the application, nothing changes.
Interestingly, the rules for what is supposed to happen in this situation
are relatively clear: if an ioctl() command has not been
implemented, the kernel should return ENOTTY. Some parts of the
kernel follow that convention, while others don't. This is not a new or
Linux-specific problem; as Linus put it: "The EINVAL thing goes way back,
and is a disaster. It predates Linux itself, as far as I can tell."
He has suggested simply changing ENOIOCTLCMD to ENOTTY
across the kernel and seeing what happens.
What happens, of course, is that the user-space ABI changes. It is
entirely possible that, somewhere out there, some program depends on
getting EINVAL for a missing ioctl() function and will
break if the return code changes. There is only one way to find out for
sure: make the change and see what happens. Mauro reports that making that change within V4L2
does not seem to break things, so chances are good that change will find
its way into 3.1. A tree-wide change could have much wider implications;
whether somebody will find the courage to try that remains to be seen.
Comments (7 posted)
Kernel development news
By Jonathan Corbet
June 29, 2011
Back in April, Phoronix
announced
with some fanfare that the 2.6.38 kernel - and those following - had a
"major" power management regression which significantly reduced battery life
on some systems. This problem has generated a fair amount of discussion,
including
this
Launchpad thread, but little in the way of solutions. Phoronix now
claims
to have located the change that caused the problem and has provided a
workaround which will make things better for some users. But a true fix
may be a while in coming.
As a result of the high clock rates used,
PCI-Express devices can take a lot of power even when they are idle.
"Active state power management" (ASPM) was developed as a means for putting those
peripherals into a lower power state when it seems that there may be little
need for them. ASPM can save power, but the usual tradeoff applies: a
device which is in a reduced power state will not be immediately available
for use. So, on systems where ASPM is in use, access to devices can
sometimes take noticeably longer if those devices have been powered down.
In some situations (usually those involving batteries) this tradeoff may be
acceptable; in others it is not. So, like most power management
mechanisms, ASPM can be turned on or off.
It is a bit more complicated than that, though; on x86 systems, the BIOS
also gets into the act. The BIOS
is charged with the initial configuration of the system and with telling
the kernel about the functionality that is present. One bit of
information passed to the kernel by the BIOS is whether ASPM is supported
on the system. The kernel developers, reasonably, concluded that, if the
BIOS says that ASPM is not supported, they should not mess with the
associated registers. It turns out, though, that this approach didn't
quite work; thus, in December, Matthew Garrett committed a
patch described as:
We currently refuse to touch the ASPM registers if the BIOS tells
us that ASPM isn't supported. This can cause problems if the BIOS
has (for any reason) enabled ASPM on some devices anyway. Change
the code such that we explicitly clear ASPM if the FADT indicates
that ASPM isn't supported, and make sure we tidy up appropriately
on device removal in order to deal with the hotplug case.
In other words, sometimes the BIOS will tell the system that ASPM is not
supported even though ASPM support is present; for added fun, the BIOS may
enable ASPM on some devices (even though it says ASPM is not supported)
before passing control to the kernel. There
are reasons why operating system developers tend to hold BIOS developers in
low esteem.
Had Andrew Morton read the above changelog, he certainly would have
complained that "can cause problems" is a rather vague justification for a
change to the kernel. Your editor asked Matthew about the problem and got
an informative response that is worth
reading in its entirety:
If this bit is set, the platform is indicating to the OS that it
doesn't support ASPM. In the past we took that to mean that we
simply shouldn't touch the ASPM bits. However, it turns out that
there's some systems where the BIOS has enabled ASPM itself, set
the "ASPM unsupported" bit and then the hardware falls over when an
ASPM transition occurs. The most straightforward thing to assume
was that the BIOS was stupid (which is, to be fair, my default
assumption) and shouldn't have enabled ASPM. So, since that patch,
we clear the ASPM state when the BIOS indicates that the platform
doesn't support ASPM.
It's not hard to imagine that putting devices into a
state that the kernel was told should not exist might create confusion
somewhere. Some research turns up, for example, this bug
report about system hangs which are fixed by Matthew's patch. If the
BIOS says that ASPM is not supported, it would seem that ensuring that no
devices think otherwise would make sense.
That said, this patch is the one that the bisection effort at Phoronix has
fingered as the cause of the power regression.
Apparently, the notion that disabling low-power states in hardware
may lead to increased power consumption also makes sense. The workaround
suggested in the article is to boot with the pcie_aspm=force
option; that forces the system to turn on ASPM regardless of whether the
BIOS claims to support it. This workaround will undoubtedly yield better
battery life on some affected systems; others may well not work at all.
In the latter case, the system may simply lock up - a state with even worse
latency characteristics combined with surprisingly bad power use. So this
workaround may be welcomed by users who have seen their battery life decline
significantly, but it is not a proper solution to the problem.
Finding that proper solution - preferably one which Just Works without any need
for special boot parameters - could be tricky. Quoting Matthew again:
What alternatives are there? We could keep the status quo and add
driver whitelisting for hardware setups that are known to work. The
problem is that even where we have specifications for the hardware,
we often don't have the errata lists. We don't know for sure
whether it works or not. We could revert this patch and add more
driver blacklisting. But then we need to track down every device
that doesn't work. Or, it's possible that the original code was
correct and Linux simply programs the hardware differently,
triggering ASPM issues that aren't seen elsewhere.
Given the uncertainty in the situation, the kernel developers have reached
the conclusion that "waste a bit of power" is a lesser evil than "lock up
on some systems." In the absence of a better understanding of the problem,
any other approach would be hard to justify. So some users may have to use
the pcie_aspm=force workaround for a while yet.
Meanwhile, the power usage problem has, as far as your editor can tell,
never been raised on any kernel development mailing list. It never
appeared in the 2.6.38 regression list. So this issue was invisible to much
of the development community; it's not entirely surprising that it has not
received much in the way of attention from developers. For better or for
worse, the development community has its way of dealing with issues.
Reporting a bug to linux-kernel certainly does not guarantee that it will
get fixed, but it does improve the odds considerably. Had this issue been
brought directly to the developers involved, we might have learned about
the root cause some time ago.
Comments (43 posted)
By Jonathan Corbet
June 29, 2011
When one thinks of embedded systems, it may be natural to think of
extremely simple processors which are just barely able to perform the tasks
which are asked of them. That view is somewhat dated, though.
Contemporary processors are often put into settings where they are expected
to manage a variety of peripherals - cameras, signal processors, radios,
etc. - while using a minimum of power. Indeed, a reasonably capable
system-on-chip (SoC) processor likely has controllers for these peripherals
built in. The result is a processor which presents a high level of
complexity to the operating system. This article will look at a couple of
patch sets which show how the kernel is changing to deal with these
processors.
Power domains
On a desktop (or laptop) system, power management is usually a matter of
putting the entire CPU into a low-power state when the load on the system
allows. Embedded processors are a little different: as noted above, they
tend to contain a wide variety of functional units. Each of these units
can be powered down (and have its clocks turned off) when it is not needed,
while the rest of the processor continues to function. The kernel can handle the
powering down of individual subsystems now; what makes things harder is the
power dependencies between devices.
Power management was one of the motivations behind the addition of the kernel's
device model in the 2.5 development series. It does not make sense, for
example, to power down a USB controller if devices attached to that
controller remain in operation. The device model captures the connection
topology of the system; this information can be used to power devices up
and down in a reasonable order. The result was much improved power
management in the 2.6 kernel.
On newer systems, though, there are likely to be dependencies between
subsystems that are not
visible in the bus topology. A set of otherwise unrelated devices may
share the same clock or power lines, meaning that they can only be powered
up or down as a group. Different SoC designs may feature combinations of
the same controllers with different power connections. As a result,
drivers for specific controllers often cannot know whether it is safe to
power down their devices - or even how to do it. This information must be
maintained at a level separate from the device hierarchy if the system is
to be able to make reasonable power management decisions.
The answer to this problem would appear to be Rafael Wysocki's generic I/O power domains patch set. A power
domain looks like this:
struct generic_pm_domain {
struct dev_pm_domain domain;
struct list_head sd_node;
struct generic_pm_domain *parent;
struct list_head sd_list;
struct list_head dev_list;
bool power_is_off;
int (*power_off)(struct generic_pm_domain *domain);
int (*power_on)(struct generic_pm_domain *domain);
int (*start_device)(struct device *dev);
int (*stop_device)(struct device *dev);
/* Others omitted */
};
Power domains are hierarchical, though the hierarchy may differ from the
bus hierarchy. So each power domain has a parent domain (parent),
a list of sibling domains (sd_node), and a list of child domains
(sd_list); there is also, naturally, a list of devices contained
within the domain (dev_list). When the kernel is changing a
domain's power state, it can use start_device() and
stop_device() to operate on specific devices, or
power_on() and power_off() to power the entire domain up
and down.
That is the core of the patch though, naturally, there is a lot of
supporting infrastructure to manage domains, let them participate in
suspend and resume, etc. The one other piece is the construction of the
domain hierarchy itself. The patch set includes one example
implementation which is added to the ARM "shmobile" subarchitecture board
file. In the longer term, there will need to be a way to represent power
domains within device trees since board files are intended to go away.
This patch set has been through several revisions and seems likely to be
merged during the 3.1 development cycle.
Asymmetric multiprocessing
When one speaks of multiprocessor systems, the context is almost always
symmetric multiprocessing - SMP - where all of the processors are equal.
An embedded SoC may not be organized that way, though. Consider, for
example, this description from the introduction to a patch set from Ohad Ben-Cohen:
OMAP4, for example, has dual Cortex-A9, dual Cortex-M3 and a C64x+
DSP. Typically, the dual cortex-A9 is running Linux in a SMP
configuration, and each of the other three cores (two M3 cores and
a DSP) is running its own instance of RTOS in an AMP configuration.
Asymmetric multiprocessing (AMP) is what you get when a system consists of
unequal processors running different operating systems. It could be
thought of as a form of (very) local-area networking, but all of those
cores sit on the same die and share access to memory, I/O controllers, and
more. This type of processor isn't simply "running Linux"; instead, it has
Linux running on some processors trying to shepherd a mixed collection of
operating systems on a variety of CPUs.
Ohad's patch is an attempt to create a structure within which Linux can
direct a processor of this type. It starts with a framework called
"remoteproc" that allows the registration of "remote" processors. Through
this framework, the kernel can power those processors up and down and
manage the loading of firmware for them to run. Much of this code is
necessarily processor-specific, but the framework abstracts away the
details and allows the kernel to deal with remote processors in a more
generic fashion.
Once the remote processor is running, the kernel needs to be able to
communicate with it. To that end, the patch set creates the concept of
"channels" which can be used to pass messages between processors. These
messages go through a ring buffer stored in memory visible to both
processors; virtio is used to implement the
rings. A small piece of processor-specific code is needed to implement a
doorbell to inform processors of when a message arrives; the rest should be
mostly independent of the actual system that it is running on.
This patch set has been reasonably well received as a good start toward the
goal of regularizing the management of AMP systems. A complete solution is
likely to require quite a bit more work, including implementations for a
wider variety of architectures. But, then, one could say that, after
twenty years, Linux as a whole is still working toward a complete
solution. The hardware continues to evolve toward more complexity; the
operating system will have to keep evolving in response. These two patch
sets give some hints of the direction that evolution is likely to take in
the near future.
Comments (1 posted)
By Jake Edge
June 29, 2011
Handling user-controlled data properly is one of the basic principles of
computer
security. Various kernel log messages allow
user-controlled strings to be placed into the messages via the "%s"
format specifier, which could be used by an attacker to potentially
confuse administrators by inserting control characters into the strings. So
Vasiliy Kulikov has proposed a patch that
would escape certain characters that appear in those strings. There is
some question as to which characters should be escaped, but the
bigger question is an age-old one in security circles: whitelisting
vs. blacklisting.
The problem stems from the idea that administrators will often use tools
like tail and more to view log files on a TTY. If a user
can insert control characters (and, in particular, escape sequences) into
the log file, they could potentially cause important information to be
overlooked—or cause other kinds of confusion. In the worst case,
escape sequences could potentially exploit some hole in the terminal
emulator program to execute code or cause other misbehavior. In the patch, Kulikov gives the following example: "Control characters
might fool root viewing the logs via tty, e.g. using ^[1A to suppress
the previous log line." For characters that are filtered, the patch
simply replaces them with "#xx", where xx is the hex value of the character.
It's a fairly minor issue, at some level, but it's not at all clear that
there is any legitimate use of control characters in those user-supplied
strings. The strings could come from various places; two that were
mentioned in the discussion were filenames or USB product ID strings. The
first version of the patch clearly went too
far by escaping characters above 0x7e (in addition to control characters),
which would exclude Unicode and other non-ASCII
characters. But after complaints about that, Kulikov's second version just
excludes control characters (i.e. < 0x20) with the exception of newline and
tab.
That didn't sit well with Ingo Molnar, however, who thought that rather than whitelisting the
known-good characters, blacklisting those known to be potentially harmful
should be done instead:
Also, i think it would be better to make this opt-out, i.e. exclude
the handful of control characters that are harmful (such as backline
and console escape), instead of trying to include the known-useful
ones.
[...]
It's also the better approach for the kernel: we handle known harmful
things and are permissive otherwise.
But, in order to create a blacklist, one must carefully determine the
effects of the various control characters on all the different terminal
emulators, whereas the whitelist approach
has the advantage of being simpler by casting a much wider net. As Kulikov
notes, figuring out which characters are
problematic is not necessarily simple:
Could you instantly answer without reading the previous discussion what
control characters are harmful, what are sometimes harmful (on some
ttys), and what are always safe and why (or even answer why it is
harmful at all)? I'm not a tty guy and I have to read console_codes(4)
or similar docs to answer this question, the majority of kernel devs
might have to read the docs too.
The disagreement between Molnar and Kulikov is one that has gone on in the
security world for many years. There is no right answer as to which
is better. As with most things in security (and software development for
that matter), there are tradeoffs between whitelists and blacklists. In
general, for user-supplied data (in web applications for example), the
consensus has been to whitelist known-good input, rather than attempting to
determine all of the "bad" input to exclude. At least in this case,
though, Molnar
does not see whitelists as the right
approach:
A black list is well-defined: it disables the display of certain
characters because they are *known to be dangerous*.
A white list on the other hand does it the wrong way around: it tries
to put the 'burden of proof' on the useful, good guys - and that's
counter-productive really.
It won't come as a surprise that Kulikov disagreed with that analysis: "What do you do with dangerous characters that are *not yet known* to be
dangerous?" While there is little question that whitelisting the
known-good characters is more secure, it is less flexible if there is
a legitimate use for other control characters in the user-supplied
strings. In addition, Molnar is skeptical
that there are hidden dangers lurking in the ASCII control characters: "This claim is silly - do you claim some 'unknown bug' in the ASCII
printout space?"
In this particular case, either solution should be just fine, as there
aren't any good reasons to include those characters, but Molnar is probably
right that there aren't hidden dangers in ASCII. There is
a question as to whether this change is needed at all, however. The
concern that spawned the patch is that
administrators might miss important messages or get fooled by carefully
crafted input (Willy Tarreau provides an interesting
example of the latter). Linus Torvalds is not convinced that it is really
a problem that needs addressing:
I really think that user space should do its own filtering - nobody
does a plain 'cat' on dmesg. Or if they do, they really have
themselves to blame.
And afaik, we don't do any escape sequence handling at the console
level either, so you cannot mess up the console with control
characters.
And the most dangerous character seems to be one that you don't
filter: the one we really do react to is '\n', and you could possibly
make confusing log messages by embedding a newline in your string and
then trying to make the rest look like something bad (say, an oops).
Given Torvalds's skepticism, it doesn't seem all that likely this patch will go
anywhere even if it were changed to a blacklisting approach as advocated
by Molnar. It is, or should be, a fairly minor concern, but the question
about blacklisting vs. whitelisting is one we will likely hear again.
There are plenty of examples of both techniques being used in security (and
other) contexts. It often comes down to a choice between more security
(whitelisting typically) or
more usability (blacklisting). This case is no different, really, and
others are sure to crop up.
Comments (19 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
- Mimi Zohar: EVM .
(June 29, 2011)
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>