Kernel development

Brief items

Kernel release status

The current development kernel is 3.0-rc5, released on June 27. "Nothing terribly exciting here. The most noteworthy thing may be that only about a quarter of the changes are in drivers, filesystem changes actually account for more (40%): btrfs, cifs, ext4, jbd2, nfs are all present and accounted for." Details can be found in the full changelog.

Stable updates: the 2.6.32.42, 2.6.33.15, and 2.6.39.2 updates were released on June 23 with a long list of fixes. The 2.6.34.10 update - the first since April - came out on June 26 with a very long list of fixes.

Comments (none posted)

Quotes of the week

As I have scratched my head for some times with rcu dynticks for the nohz cpuset things, I guess that in the meantime my unconscious mind developed the idea that rcu extended quiescent states were a worldwide topic that every people talk about in dinner with their family, that these even became the core stories of some lullabies. I can still hear its chorus, that makes one switching to idle peacefully...

-- Frederic Weisbecker

The V4L2 spec needs to be fixed with respect to error codes. Driver authors are much more creative than DocBook authors.

-- Mauro Carvalho Chehab

The various POSIX_FADV_foo's are so ill-defined that it was a mistake to ever use them. We should have done something overtly linux-specific and given userspace more explicit and direct pagecache control.

-- Andrew Morton

Comments (none posted)

2011 Kernel Summit Planning starts

The kickoff message for the 2011 Kernel Summit planning process has been sent out. The event will be held October 24-26 in Prague, with some tweaks to the longstanding formula. "This year, the biggest change is that the conference will be running three days, where the first day will be dedicated to some kernel subsystem workshops. The second day will be focused on development process issues and be more discussion oriented; for this reason, it will be limited to core kernel developers picked through a nomination and selection process which as in previous years. The third day will be more presentation oriented (although hopefully we will have some discussion); and all kernel summit workshop attendees will be welcome attend the 3rd day." Now is the time for interested folks to help shape the agenda.

Full Story (comments: none)

-EWHICHERROR?

By Jonathan Corbet
June 29, 2011

Users of the Video4Linux2 API know that it is a rather complicated one, involving some 91 different ioctl() commands. The error-reporting side of the API is much simpler, though; if something goes wrong, the application is almost certain to get EINVAL back. That error can be trying to tell user space that the device is in the wrong state, that some parameter was out of range, or, simply, that the requested command has not been implemented. Needless to say, it can be hard for developers to figure out what is really going on.

V4L2 maintainer Mauro Carvalho Chehab recently posted a patch to change the return code to ENOIOCTLCMD in cases where the underlying driver has not actually implemented the requested command. That change would at least distinguish one set of problems - except that the VFS code silently translates ENOIOCTLCMD to EINVAL before returning to user space. So, from the point of view of the application, nothing changes.

Interestingly, the rules for what is supposed to happen in this situation are relatively clear: if an ioctl() command has not been implemented, the kernel should return ENOTTY. Some parts of the kernel follow that convention, while others don't. This is not a new or Linux-specific problem; as Linus put it: "The EINVAL thing goes way back, and is a disaster. It predates Linux itself, as far as I can tell." He has suggested simply changing ENOIOCTLCMD to ENOTTY across the kernel and seeing what happens.

What happens, of course, is that the user-space ABI changes. It is entirely possible that, somewhere out there, some program depends on getting EINVAL for a missing ioctl() function and will break if the return code changes. There is only one way to find out for sure: make the change and see what happens. Mauro reports that making that change within V4L2 does not seem to break things, so chances are good that change will find its way into 3.1. A tree-wide change could have much wider implications; whether somebody will find the courage to try that remains to be seen.

Comments (7 posted)

Kernel development news

PCIe, power management, and problematic BIOSes

By Jonathan Corbet
June 29, 2011

Back in April, Phoronix announced with some fanfare that the 2.6.38 kernel - and those following - had a "major" power management regression which significantly reduced battery life on some systems. This problem has generated a fair amount of discussion, including this Launchpad thread, but little in the way of solutions. Phoronix now claims to have located the change that caused the problem and has provided a workaround which will make things better for some users. But a true fix may be a while in coming.

As a result of the high clock rates used, PCI-Express devices can take a lot of power even when they are idle. "Active state power management" (ASPM) was developed as a means for putting those peripherals into a lower power state when it seems that there may be little need for them. ASPM can save power, but the usual tradeoff applies: a device which is in a reduced power state will not be immediately available for use. So, on systems where ASPM is in use, access to devices can sometimes take noticeably longer if those devices have been powered down. In some situations (usually those involving batteries) this tradeoff may be acceptable; in others it is not. So, like most power management mechanisms, ASPM can be turned on or off.

It is a bit more complicated than that, though; on x86 systems, the BIOS also gets into the act. The BIOS is charged with the initial configuration of the system and with telling the kernel about the functionality that is present. One bit of information passed to the kernel by the BIOS is whether ASPM is supported on the system. The kernel developers, reasonably, concluded that, if the BIOS says that ASPM is not supported, they should not mess with the associated registers. It turns out, though, that this approach didn't quite work; thus, in December, Matthew Garrett committed a patch described as:

We currently refuse to touch the ASPM registers if the BIOS tells us that ASPM isn't supported. This can cause problems if the BIOS has (for any reason) enabled ASPM on some devices anyway. Change the code such that we explicitly clear ASPM if the FADT indicates that ASPM isn't supported, and make sure we tidy up appropriately on device removal in order to deal with the hotplug case.

In other words, sometimes the BIOS will tell the system that ASPM is not supported even though ASPM support is present; for added fun, the BIOS may enable ASPM on some devices (even though it says ASPM is not supported) before passing control to the kernel. There are reasons why operating system developers tend to hold BIOS developers in low esteem.

Had Andrew Morton read the above changelog, he certainly would have complained that "can cause problems" is a rather vague justification for a change to the kernel. Your editor asked Matthew about the problem and got an informative response that is worth reading in its entirety:

If this bit is set, the platform is indicating to the OS that it doesn't support ASPM. In the past we took that to mean that we simply shouldn't touch the ASPM bits. However, it turns out that there's some systems where the BIOS has enabled ASPM itself, set the "ASPM unsupported" bit and then the hardware falls over when an ASPM transition occurs. The most straightforward thing to assume was that the BIOS was stupid (which is, to be fair, my default assumption) and shouldn't have enabled ASPM. So, since that patch, we clear the ASPM state when the BIOS indicates that the platform doesn't support ASPM.

It's not hard to imagine that putting devices into a state that the kernel was told should not exist might create confusion somewhere. Some research turns up, for example, this bug report about system hangs which are fixed by Matthew's patch. If the BIOS says that ASPM is not supported, it would seem that ensuring that no devices think otherwise would make sense.

That said, this patch is the one that the bisection effort at Phoronix has fingered as the cause of the power regression. Apparently, the notion that disabling low-power states in hardware may lead to increased power consumption also makes sense. The workaround suggested in the article is to boot with the pcie_aspm=force option; that forces the system to turn on ASPM regardless of whether the BIOS claims to support it. This workaround will undoubtedly yield better battery life on some affected systems; others may well not work at all. In the latter case, the system may simply lock up - a state with even worse latency characteristics combined with surprisingly bad power use. So this workaround may be welcomed by users who have seen their battery life decline significantly, but it is not a proper solution to the problem.

Finding that proper solution - preferably one which Just Works without any need for special boot parameters - could be tricky. Quoting Matthew again:

What alternatives are there? We could keep the status quo and add driver whitelisting for hardware setups that are known to work. The problem is that even where we have specifications for the hardware, we often don't have the errata lists. We don't know for sure whether it works or not. We could revert this patch and add more driver blacklisting. But then we need to track down every device that doesn't work. Or, it's possible that the original code was correct and Linux simply programs the hardware differently, triggering ASPM issues that aren't seen elsewhere.

Given the uncertainty in the situation, the kernel developers have reached the conclusion that "waste a bit of power" is a lesser evil than "lock up on some systems." In the absence of a better understanding of the problem, any other approach would be hard to justify. So some users may have to use the pcie_aspm=force workaround for a while yet.

Meanwhile, the power usage problem has, as far as your editor can tell, never been raised on any kernel development mailing list. It never appeared in the 2.6.38 regression list. So this issue was invisible to much of the development community; it's not entirely surprising that it has not received much in the way of attention from developers. For better or for worse, the development community has its way of dealing with issues. Reporting a bug to linux-kernel certainly does not guarantee that it will get fixed, but it does improve the odds considerably. Had this issue been brought directly to the developers involved, we might have learned about the root cause some time ago.

Comments (45 posted)

Dealing with complexity: power domains and asymmetric multiprocessing

By Jonathan Corbet
June 29, 2011

When one thinks of embedded systems, it may be natural to think of extremely simple processors which are just barely able to perform the tasks which are asked of them. That view is somewhat dated, though. Contemporary processors are often put into settings where they are expected to manage a variety of peripherals - cameras, signal processors, radios, etc. - while using a minimum of power. Indeed, a reasonably capable system-on-chip (SoC) processor likely has controllers for these peripherals built in. The result is a processor which presents a high level of complexity to the operating system. This article will look at a couple of patch sets which show how the kernel is changing to deal with these processors.

Power domains

On a desktop (or laptop) system, power management is usually a matter of putting the entire CPU into a low-power state when the load on the system allows. Embedded processors are a little different: as noted above, they tend to contain a wide variety of functional units. Each of these units can be powered down (and have its clocks turned off) when it is not needed, while the rest of the processor continues to function. The kernel can handle the powering down of individual subsystems now; what makes things harder is the power dependencies between devices.

Power management was one of the motivations behind the addition of the kernel's device model in the 2.5 development series. It does not make sense, for example, to power down a USB controller if devices attached to that controller remain in operation. The device model captures the connection topology of the system; this information can be used to power devices up and down in a reasonable order. The result was much improved power management in the 2.6 kernel.

On newer systems, though, there are likely to be dependencies between subsystems that are not visible in the bus topology. A set of otherwise unrelated devices may share the same clock or power lines, meaning that they can only be powered up or down as a group. Different SoC designs may feature combinations of the same controllers with different power connections. As a result, drivers for specific controllers often cannot know whether it is safe to power down their devices - or even how to do it. This information must be maintained at a level separate from the device hierarchy if the system is to be able to make reasonable power management decisions.

The answer to this problem would appear to be Rafael Wysocki's generic I/O power domains patch set. A power domain looks like this:

    struct generic_pm_domain {
	struct dev_pm_domain domain;	
	struct list_head sd_node;	
	struct generic_pm_domain *parent;
	struct list_head sd_list;
	struct list_head dev_list;
	bool power_is_off;
	int (*power_off)(struct generic_pm_domain *domain);
	int (*power_on)(struct generic_pm_domain *domain);
	int (*start_device)(struct device *dev);
	int (*stop_device)(struct device *dev);
        /* Others omitted */
    };

Power domains are hierarchical, though the hierarchy may differ from the bus hierarchy. So each power domain has a parent domain (parent), a list of sibling domains (sd_node), and a list of child domains (sd_list); there is also, naturally, a list of devices contained within the domain (dev_list). When the kernel is changing a domain's power state, it can use start_device() and stop_device() to operate on specific devices, or power_on() and power_off() to power the entire domain up and down.

That is the core of the patch though, naturally, there is a lot of supporting infrastructure to manage domains, let them participate in suspend and resume, etc. The one other piece is the construction of the domain hierarchy itself. The patch set includes one example implementation which is added to the ARM "shmobile" subarchitecture board file. In the longer term, there will need to be a way to represent power domains within device trees since board files are intended to go away.

This patch set has been through several revisions and seems likely to be merged during the 3.1 development cycle.

Asymmetric multiprocessing

When one speaks of multiprocessor systems, the context is almost always symmetric multiprocessing - SMP - where all of the processors are equal. An embedded SoC may not be organized that way, though. Consider, for example, this description from the introduction to a patch set from Ohad Ben-Cohen:

OMAP4, for example, has dual Cortex-A9, dual Cortex-M3 and a C64x+ DSP. Typically, the dual cortex-A9 is running Linux in a SMP configuration, and each of the other three cores (two M3 cores and a DSP) is running its own instance of RTOS in an AMP configuration.

Asymmetric multiprocessing (AMP) is what you get when a system consists of unequal processors running different operating systems. It could be thought of as a form of (very) local-area networking, but all of those cores sit on the same die and share access to memory, I/O controllers, and more. This type of processor isn't simply "running Linux"; instead, it has Linux running on some processors trying to shepherd a mixed collection of operating systems on a variety of CPUs.

Ohad's patch is an attempt to create a structure within which Linux can direct a processor of this type. It starts with a framework called "remoteproc" that allows the registration of "remote" processors. Through this framework, the kernel can power those processors up and down and manage the loading of firmware for them to run. Much of this code is necessarily processor-specific, but the framework abstracts away the details and allows the kernel to deal with remote processors in a more generic fashion.

Once the remote processor is running, the kernel needs to be able to communicate with it. To that end, the patch set creates the concept of "channels" which can be used to pass messages between processors. These messages go through a ring buffer stored in memory visible to both processors; virtio is used to implement the rings. A small piece of processor-specific code is needed to implement a doorbell to inform processors of when a message arrives; the rest should be mostly independent of the actual system that it is running on.

This patch set has been reasonably well received as a good start toward the goal of regularizing the management of AMP systems. A complete solution is likely to require quite a bit more work, including implementations for a wider variety of architectures. But, then, one could say that, after twenty years, Linux as a whole is still working toward a complete solution. The hardware continues to evolve toward more complexity; the operating system will have to keep evolving in response. These two patch sets give some hints of the direction that evolution is likely to take in the near future.

Comments (1 posted)

Sanitizing log file output

By Jake Edge
June 29, 2011

Handling user-controlled data properly is one of the basic principles of computer security. Various kernel log messages allow user-controlled strings to be placed into the messages via the "%s" format specifier, which could be used by an attacker to potentially confuse administrators by inserting control characters into the strings. So Vasiliy Kulikov has proposed a patch that would escape certain characters that appear in those strings. There is some question as to which characters should be escaped, but the bigger question is an age-old one in security circles: whitelisting vs. blacklisting.

The problem stems from the idea that administrators will often use tools like tail and more to view log files on a TTY. If a user can insert control characters (and, in particular, escape sequences) into the log file, they could potentially cause important information to be overlooked—or cause other kinds of confusion. In the worst case, escape sequences could potentially exploit some hole in the terminal emulator program to execute code or cause other misbehavior. In the patch, Kulikov gives the following example: "Control characters might fool root viewing the logs via tty, e.g. using ^[1A to suppress the previous log line." For characters that are filtered, the patch simply replaces them with "#xx", where xx is the hex value of the character.

It's a fairly minor issue, at some level, but it's not at all clear that there is any legitimate use of control characters in those user-supplied strings. The strings could come from various places; two that were mentioned in the discussion were filenames or USB product ID strings. The first version of the patch clearly went too far by escaping characters above 0x7e (in addition to control characters), which would exclude Unicode and other non-ASCII characters. But after complaints about that, Kulikov's second version just excludes control characters (i.e. < 0x20) with the exception of newline and tab.

That didn't sit well with Ingo Molnar, however, who thought that rather than whitelisting the known-good characters, blacklisting those known to be potentially harmful should be done instead:

Also, i think it would be better to make this opt-out, i.e. exclude the handful of control characters that are harmful (such as backline and console escape), instead of trying to include the known-useful ones.

[...] It's also the better approach for the kernel: we handle known harmful things and are permissive otherwise.

But, in order to create a blacklist, one must carefully determine the effects of the various control characters on all the different terminal emulators, whereas the whitelist approach has the advantage of being simpler by casting a much wider net. As Kulikov notes, figuring out which characters are problematic is not necessarily simple:

Could you instantly answer without reading the previous discussion what control characters are harmful, what are sometimes harmful (on some ttys), and what are always safe and why (or even answer why it is harmful at all)? I'm not a tty guy and I have to read console_codes(4) or similar docs to answer this question, the majority of kernel devs might have to read the docs too.

The disagreement between Molnar and Kulikov is one that has gone on in the security world for many years. There is no right answer as to which is better. As with most things in security (and software development for that matter), there are tradeoffs between whitelists and blacklists. In general, for user-supplied data (in web applications for example), the consensus has been to whitelist known-good input, rather than attempting to determine all of the "bad" input to exclude. At least in this case, though, Molnar does not see whitelists as the right approach:

A black list is well-defined: it disables the display of certain characters because they are *known to be dangerous*.

A white list on the other hand does it the wrong way around: it tries to put the 'burden of proof' on the useful, good guys - and that's counter-productive really.

It won't come as a surprise that Kulikov disagreed with that analysis: "What do you do with dangerous characters that are *not yet known* to be dangerous?" While there is little question that whitelisting the known-good characters is more secure, it is less flexible if there is a legitimate use for other control characters in the user-supplied strings. In addition, Molnar is skeptical that there are hidden dangers lurking in the ASCII control characters: "This claim is silly - do you claim some 'unknown bug' in the ASCII printout space?"

In this particular case, either solution should be just fine, as there aren't any good reasons to include those characters, but Molnar is probably right that there aren't hidden dangers in ASCII. There is a question as to whether this change is needed at all, however. The concern that spawned the patch is that administrators might miss important messages or get fooled by carefully crafted input (Willy Tarreau provides an interesting example of the latter). Linus Torvalds is not convinced that it is really a problem that needs addressing:

I really think that user space should do its own filtering - nobody does a plain 'cat' on dmesg. Or if they do, they really have themselves to blame.

And afaik, we don't do any escape sequence handling at the console level either, so you cannot mess up the console with control characters.

And the most dangerous character seems to be one that you don't filter: the one we really do react to is '\n', and you could possibly make confusing log messages by embedding a newline in your string and then trying to make the rest look like something bad (say, an oops).

Given Torvalds's skepticism, it doesn't seem all that likely this patch will go anywhere even if it were changed to a blacklisting approach as advocated by Molnar. It is, or should be, a fairly minor concern, but the question about blacklisting vs. whitelisting is one we will likely hear again. There are plenty of examples of both techniques being used in security (and other) contexts. It often comes down to a choice between more security (whitelisting typically) or more usability (blacklisting). This case is no different, really, and others are sure to crop up.

Comments (19 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.0-rc5 ?

Greg KH Linux 2.6.39.2 ?

Paul Gortmaker Linux 2.6.34.10 has been released ?

Greg KH Linux 2.6.33.15 ?

Greg KH Linux 2.6.32.42 ?

Architecture-specific

Jeremy Fitzhardinge x86: convert ticketlocks to C and remove duplicate code ?

Vincent Guittot Add ARM cpu topology definition ?

Daniel Drake OLPC Power Management ?

Becky Bruce Hugetlb for 32-bit FSL PowerPC BookE ?

Core kernel code

Jonas Bonn modules: add default loader hook implementations ?

Development tools

Hui Zhu KGTP (Linux Kernel debugger and tracer) 201100626 release ?

Masami Hiramatsu tracing/kprobes: Dynamic events on module support ?

Jim Cromie dynamic_debug: allow multiple pending queries on boot-line ?

Device drivers

Lars-Peter Clausen ASoC: Add ADAV80x codec driver ?

Shawn Guo Add basic device support for imx51 babbage ?

Rafael J. Wysocki PM / Domains: Support for generic I/O PM domains ?

Lars-Peter Clausen ASoC: Add ADAV80x codec driver ?

Mark Brown Generic I2C and SPI register map library ?

Eric Andersson input: add driver for Bosch Sensortec's BMA150 accelerometer ?

Alexander Stein hwmon: LM95245 driver ?

Sascha Hauer implement a generic PWM framework - once again ?

djkurtz@chromium.org Synaptics image sensor support ?

Tomasz Stanislawski TV drivers for Samsung S5P platform (media part) ?

Documentation

Németh Márton USBIP protocol documentation ?

Filesystems and block I/O

david.wagner@free-electrons.com ubiblk: read-only block layer on top of UBI ?

Andrea Righi fadvise: move active pages to inactive list with POSIX_FADV_DONTNEED ?

Andrea Righi fadvise: support POSIX_FADV_NOREUSE ?

Christoph Hellwig remove i_alloc_sem V2 ?

Josef Bacik fs: add SEEK_HOLE and SEEK_DATA flags ?

Vivek Goyal blk-throttle: Throttle buffered WRITEs in balance_dirty_pages() ?

Arne Jansen btrfs: generic readeahead interface ?

Memory management

Geunsik Lim munmap: Flexible mem unmap operation interface for scheduling latency ?

Dan Magenheimer [PATCH] drivers/staging/zcache: support multiple clients, prep for RAMster and KVM ?

craigb x86: mm: dynamic BadRAM (extended E820) ?

Wu Fengguang write bandwidth estimation and writeback fixes v2 ?

Networking

Aloisio Almeida Jr NFC subsystem ?

Security-related

Roberto Sassu eCryptfs: added support for the encrypted key type ?

Tetsuo Handa TOMOYO 2.4 Core part changes. ?

Mimi Zohar EVM ?

Virtualization and containers

Glauber Costa Steal time for KVM ?

Benchmarks and bugs

Rafael J. Wysocki 3.0-rc4: Reported regressions from 2.6.39 ?

Rafael J. Wysocki 3.0-rc4: Reported regressions 2.6.38 -> 2.6.39 ?

Page editor: Jonathan Corbet
Next page: Distributions>>