Kernel development
Brief items
Kernel release status
The current development kernel is 3.0-rc5, released on June 27. "Nothing terribly exciting here. The most noteworthy thing may be that only about a quarter of the changes are in drivers, filesystem changes actually account for more (40%): btrfs, cifs, ext4, jbd2, nfs are all present and accounted for." Details can be found in the full changelog.
Stable updates: the 2.6.32.42, 2.6.33.15, and 2.6.39.2 updates were released on June 23 with a long list of fixes. The 2.6.34.10 update - the first since April - came out on June 26 with a very long list of fixes.
Quotes of the week
2011 Kernel Summit Planning starts
The kickoff message for the 2011 Kernel Summit planning process has been sent out. The event will be held October 24-26 in Prague, with some tweaks to the longstanding formula. "This year, the biggest change is that the conference will be running three days, where the first day will be dedicated to some kernel subsystem workshops. The second day will be focused on development process issues and be more discussion oriented; for this reason, it will be limited to core kernel developers picked through a nomination and selection process which as in previous years. The third day will be more presentation oriented (although hopefully we will have some discussion); and all kernel summit workshop attendees will be welcome attend the 3rd day." Now is the time for interested folks to help shape the agenda.
-EWHICHERROR?
Users of the Video4Linux2 API know that it is a rather complicated one, involving some 91 different ioctl() commands. The error-reporting side of the API is much simpler, though; if something goes wrong, the application is almost certain to get EINVAL back. That error can be trying to tell user space that the device is in the wrong state, that some parameter was out of range, or, simply, that the requested command has not been implemented. Needless to say, it can be hard for developers to figure out what is really going on.V4L2 maintainer Mauro Carvalho Chehab recently posted a patch to change the return code to ENOIOCTLCMD in cases where the underlying driver has not actually implemented the requested command. That change would at least distinguish one set of problems - except that the VFS code silently translates ENOIOCTLCMD to EINVAL before returning to user space. So, from the point of view of the application, nothing changes.
Interestingly, the rules for what is supposed to happen in this situation
are relatively clear: if an ioctl() command has not been
implemented, the kernel should return ENOTTY. Some parts of the
kernel follow that convention, while others don't. This is not a new or
Linux-specific problem; as Linus put it: "The EINVAL thing goes way back,
and is a disaster. It predates Linux itself, as far as I can tell.
"
He has suggested simply changing ENOIOCTLCMD to ENOTTY
across the kernel and seeing what happens.
What happens, of course, is that the user-space ABI changes. It is entirely possible that, somewhere out there, some program depends on getting EINVAL for a missing ioctl() function and will break if the return code changes. There is only one way to find out for sure: make the change and see what happens. Mauro reports that making that change within V4L2 does not seem to break things, so chances are good that change will find its way into 3.1. A tree-wide change could have much wider implications; whether somebody will find the courage to try that remains to be seen.
Kernel development news
PCIe, power management, and problematic BIOSes
Back in April, Phoronix announced with some fanfare that the 2.6.38 kernel - and those following - had a "major" power management regression which significantly reduced battery life on some systems. This problem has generated a fair amount of discussion, including this Launchpad thread, but little in the way of solutions. Phoronix now claims to have located the change that caused the problem and has provided a workaround which will make things better for some users. But a true fix may be a while in coming.As a result of the high clock rates used, PCI-Express devices can take a lot of power even when they are idle. "Active state power management" (ASPM) was developed as a means for putting those peripherals into a lower power state when it seems that there may be little need for them. ASPM can save power, but the usual tradeoff applies: a device which is in a reduced power state will not be immediately available for use. So, on systems where ASPM is in use, access to devices can sometimes take noticeably longer if those devices have been powered down. In some situations (usually those involving batteries) this tradeoff may be acceptable; in others it is not. So, like most power management mechanisms, ASPM can be turned on or off.
It is a bit more complicated than that, though; on x86 systems, the BIOS also gets into the act. The BIOS is charged with the initial configuration of the system and with telling the kernel about the functionality that is present. One bit of information passed to the kernel by the BIOS is whether ASPM is supported on the system. The kernel developers, reasonably, concluded that, if the BIOS says that ASPM is not supported, they should not mess with the associated registers. It turns out, though, that this approach didn't quite work; thus, in December, Matthew Garrett committed a patch described as:
In other words, sometimes the BIOS will tell the system that ASPM is not supported even though ASPM support is present; for added fun, the BIOS may enable ASPM on some devices (even though it says ASPM is not supported) before passing control to the kernel. There are reasons why operating system developers tend to hold BIOS developers in low esteem.
Had Andrew Morton read the above changelog, he certainly would have complained that "can cause problems" is a rather vague justification for a change to the kernel. Your editor asked Matthew about the problem and got an informative response that is worth reading in its entirety:
It's not hard to imagine that putting devices into a state that the kernel was told should not exist might create confusion somewhere. Some research turns up, for example, this bug report about system hangs which are fixed by Matthew's patch. If the BIOS says that ASPM is not supported, it would seem that ensuring that no devices think otherwise would make sense.
That said, this patch is the one that the bisection effort at Phoronix has fingered as the cause of the power regression. Apparently, the notion that disabling low-power states in hardware may lead to increased power consumption also makes sense. The workaround suggested in the article is to boot with the pcie_aspm=force option; that forces the system to turn on ASPM regardless of whether the BIOS claims to support it. This workaround will undoubtedly yield better battery life on some affected systems; others may well not work at all. In the latter case, the system may simply lock up - a state with even worse latency characteristics combined with surprisingly bad power use. So this workaround may be welcomed by users who have seen their battery life decline significantly, but it is not a proper solution to the problem.
Finding that proper solution - preferably one which Just Works without any need for special boot parameters - could be tricky. Quoting Matthew again:
Given the uncertainty in the situation, the kernel developers have reached the conclusion that "waste a bit of power" is a lesser evil than "lock up on some systems." In the absence of a better understanding of the problem, any other approach would be hard to justify. So some users may have to use the pcie_aspm=force workaround for a while yet.
Meanwhile, the power usage problem has, as far as your editor can tell, never been raised on any kernel development mailing list. It never appeared in the 2.6.38 regression list. So this issue was invisible to much of the development community; it's not entirely surprising that it has not received much in the way of attention from developers. For better or for worse, the development community has its way of dealing with issues. Reporting a bug to linux-kernel certainly does not guarantee that it will get fixed, but it does improve the odds considerably. Had this issue been brought directly to the developers involved, we might have learned about the root cause some time ago.
Dealing with complexity: power domains and asymmetric multiprocessing
When one thinks of embedded systems, it may be natural to think of extremely simple processors which are just barely able to perform the tasks which are asked of them. That view is somewhat dated, though. Contemporary processors are often put into settings where they are expected to manage a variety of peripherals - cameras, signal processors, radios, etc. - while using a minimum of power. Indeed, a reasonably capable system-on-chip (SoC) processor likely has controllers for these peripherals built in. The result is a processor which presents a high level of complexity to the operating system. This article will look at a couple of patch sets which show how the kernel is changing to deal with these processors.
Power domains
On a desktop (or laptop) system, power management is usually a matter of putting the entire CPU into a low-power state when the load on the system allows. Embedded processors are a little different: as noted above, they tend to contain a wide variety of functional units. Each of these units can be powered down (and have its clocks turned off) when it is not needed, while the rest of the processor continues to function. The kernel can handle the powering down of individual subsystems now; what makes things harder is the power dependencies between devices.
Power management was one of the motivations behind the addition of the kernel's device model in the 2.5 development series. It does not make sense, for example, to power down a USB controller if devices attached to that controller remain in operation. The device model captures the connection topology of the system; this information can be used to power devices up and down in a reasonable order. The result was much improved power management in the 2.6 kernel.
On newer systems, though, there are likely to be dependencies between subsystems that are not visible in the bus topology. A set of otherwise unrelated devices may share the same clock or power lines, meaning that they can only be powered up or down as a group. Different SoC designs may feature combinations of the same controllers with different power connections. As a result, drivers for specific controllers often cannot know whether it is safe to power down their devices - or even how to do it. This information must be maintained at a level separate from the device hierarchy if the system is to be able to make reasonable power management decisions.
The answer to this problem would appear to be Rafael Wysocki's generic I/O power domains patch set. A power domain looks like this:
struct generic_pm_domain { struct dev_pm_domain domain; struct list_head sd_node; struct generic_pm_domain *parent; struct list_head sd_list; struct list_head dev_list; bool power_is_off; int (*power_off)(struct generic_pm_domain *domain); int (*power_on)(struct generic_pm_domain *domain); int (*start_device)(struct device *dev); int (*stop_device)(struct device *dev); /* Others omitted */ };
Power domains are hierarchical, though the hierarchy may differ from the bus hierarchy. So each power domain has a parent domain (parent), a list of sibling domains (sd_node), and a list of child domains (sd_list); there is also, naturally, a list of devices contained within the domain (dev_list). When the kernel is changing a domain's power state, it can use start_device() and stop_device() to operate on specific devices, or power_on() and power_off() to power the entire domain up and down.
That is the core of the patch though, naturally, there is a lot of supporting infrastructure to manage domains, let them participate in suspend and resume, etc. The one other piece is the construction of the domain hierarchy itself. The patch set includes one example implementation which is added to the ARM "shmobile" subarchitecture board file. In the longer term, there will need to be a way to represent power domains within device trees since board files are intended to go away.
This patch set has been through several revisions and seems likely to be merged during the 3.1 development cycle.
Asymmetric multiprocessing
When one speaks of multiprocessor systems, the context is almost always symmetric multiprocessing - SMP - where all of the processors are equal. An embedded SoC may not be organized that way, though. Consider, for example, this description from the introduction to a patch set from Ohad Ben-Cohen:
Asymmetric multiprocessing (AMP) is what you get when a system consists of unequal processors running different operating systems. It could be thought of as a form of (very) local-area networking, but all of those cores sit on the same die and share access to memory, I/O controllers, and more. This type of processor isn't simply "running Linux"; instead, it has Linux running on some processors trying to shepherd a mixed collection of operating systems on a variety of CPUs.
Ohad's patch is an attempt to create a structure within which Linux can direct a processor of this type. It starts with a framework called "remoteproc" that allows the registration of "remote" processors. Through this framework, the kernel can power those processors up and down and manage the loading of firmware for them to run. Much of this code is necessarily processor-specific, but the framework abstracts away the details and allows the kernel to deal with remote processors in a more generic fashion.
Once the remote processor is running, the kernel needs to be able to communicate with it. To that end, the patch set creates the concept of "channels" which can be used to pass messages between processors. These messages go through a ring buffer stored in memory visible to both processors; virtio is used to implement the rings. A small piece of processor-specific code is needed to implement a doorbell to inform processors of when a message arrives; the rest should be mostly independent of the actual system that it is running on.
This patch set has been reasonably well received as a good start toward the goal of regularizing the management of AMP systems. A complete solution is likely to require quite a bit more work, including implementations for a wider variety of architectures. But, then, one could say that, after twenty years, Linux as a whole is still working toward a complete solution. The hardware continues to evolve toward more complexity; the operating system will have to keep evolving in response. These two patch sets give some hints of the direction that evolution is likely to take in the near future.
Sanitizing log file output
Handling user-controlled data properly is one of the basic principles of computer security. Various kernel log messages allow user-controlled strings to be placed into the messages via the "%s" format specifier, which could be used by an attacker to potentially confuse administrators by inserting control characters into the strings. So Vasiliy Kulikov has proposed a patch that would escape certain characters that appear in those strings. There is some question as to which characters should be escaped, but the bigger question is an age-old one in security circles: whitelisting vs. blacklisting.
The problem stems from the idea that administrators will often use tools
like tail and more to view log files on a TTY. If a user
can insert control characters (and, in particular, escape sequences) into
the log file, they could potentially cause important information to be
overlooked—or cause other kinds of confusion. In the worst case,
escape sequences could potentially exploit some hole in the terminal
emulator program to execute code or cause other misbehavior. In the patch, Kulikov gives the following example: "Control characters
might fool root viewing the logs via tty, e.g. using ^[1A to suppress
the previous log line.
" For characters that are filtered, the patch
simply replaces them with "#xx", where xx is the hex value of the character.
It's a fairly minor issue, at some level, but it's not at all clear that there is any legitimate use of control characters in those user-supplied strings. The strings could come from various places; two that were mentioned in the discussion were filenames or USB product ID strings. The first version of the patch clearly went too far by escaping characters above 0x7e (in addition to control characters), which would exclude Unicode and other non-ASCII characters. But after complaints about that, Kulikov's second version just excludes control characters (i.e. < 0x20) with the exception of newline and tab.
That didn't sit well with Ingo Molnar, however, who thought that rather than whitelisting the known-good characters, blacklisting those known to be potentially harmful should be done instead:
[...] It's also the better approach for the kernel: we handle known harmful things and are permissive otherwise.
But, in order to create a blacklist, one must carefully determine the effects of the various control characters on all the different terminal emulators, whereas the whitelist approach has the advantage of being simpler by casting a much wider net. As Kulikov notes, figuring out which characters are problematic is not necessarily simple:
The disagreement between Molnar and Kulikov is one that has gone on in the security world for many years. There is no right answer as to which is better. As with most things in security (and software development for that matter), there are tradeoffs between whitelists and blacklists. In general, for user-supplied data (in web applications for example), the consensus has been to whitelist known-good input, rather than attempting to determine all of the "bad" input to exclude. At least in this case, though, Molnar does not see whitelists as the right approach:
A white list on the other hand does it the wrong way around: it tries to put the 'burden of proof' on the useful, good guys - and that's counter-productive really.
It won't come as a surprise that Kulikov disagreed with that analysis: "What do you do with dangerous characters that are *not yet known* to be
dangerous?
" While there is little question that whitelisting the
known-good characters is more secure, it is less flexible if there is
a legitimate use for other control characters in the user-supplied
strings. In addition, Molnar is skeptical
that there are hidden dangers lurking in the ASCII control characters: "This claim is silly - do you claim some 'unknown bug' in the ASCII
printout space?
"
In this particular case, either solution should be just fine, as there aren't any good reasons to include those characters, but Molnar is probably right that there aren't hidden dangers in ASCII. There is a question as to whether this change is needed at all, however. The concern that spawned the patch is that administrators might miss important messages or get fooled by carefully crafted input (Willy Tarreau provides an interesting example of the latter). Linus Torvalds is not convinced that it is really a problem that needs addressing:
And afaik, we don't do any escape sequence handling at the console level either, so you cannot mess up the console with control characters.
And the most dangerous character seems to be one that you don't filter: the one we really do react to is '\n', and you could possibly make confusing log messages by embedding a newline in your string and then trying to make the rest look like something bad (say, an oops).
Given Torvalds's skepticism, it doesn't seem all that likely this patch will go anywhere even if it were changed to a blacklisting approach as advocated by Molnar. It is, or should be, a fairly minor concern, but the question about blacklisting vs. whitelisting is one we will likely hear again. There are plenty of examples of both techniques being used in security (and other) contexts. It often comes down to a choice between more security (whitelisting typically) or more usability (blacklisting). This case is no different, really, and others are sure to crop up.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page:
Distributions>>