User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.2-rc7, released on December 23. As of this writing, the final 3.2 release is imminent; one assumes that Linus is waiting for the LWN Weekly Edition to be published first.

Update: the 3.2 release did, indeed, happen at the predicted time.

Stable updates: the, 3.0.15 and 3.1.7 updates were released on January 3; they contain a single fix for a resume problem experienced by some users. The, 3.0.16, and 3.1.8 updates are in the review process as of this writing. They contain a longer list of fixes, and can be expected on or after January 6.

Comments (2 posted)

Quotes of the week

We're beyond the point where any additional kernel complexity should be considered a regression.
-- Andrew Morton

It would be nice if kernel developers understood that GFP_KERNEL is strongly preferred and that they should put in effort to use it. But there's a strong tendency for people to get themselves into a sticky corner then take the easy way out, resulting in less robust code. Maybe calling the function alloc_percpu_i_really_suck() would convey the hint.
-- Andrew Morton

But hey, the development kernels are still way more interesting than some boring stable kernel. New and exciting features, and you can feel like you're living at the edge rather than puttering around with your grandmothers OS. So pthhththt at you, Greg.
-- Linus Torvalds

Weighing all that up, I don't think it is useful to set our goal on "getting Android to use a mainline kernel" - that isn't going to happen. Rather we should focus primarily on "making it *possible* to run android user-space on mainline".
-- Neil Brown

Comments (none posted)

An IOPS-based I/O scheduler

By Jonathan Corbet
January 4, 2012
I/O schedulers are charged with ordering block I/O operations in a way that maximizes throughput to the device and, perhaps, implementing the system's policy with regard to how the available bandwidth should be divided. The schedulers currently in use in Linux were designed with rotating storage in mind, with the result that they are concerned with avoiding disk seeks and tracking the number of bytes transferred. With solid-state devices, though, I/O locality is (nearly) irrelevant and the number of I/O operations performed is considered to be a better measurement of the amount of device capacity used. The kernel's CFQ scheduler has been evolving to deal better with solid-state devices, but everybody agrees there is more to be done.

Shaohua Li has taken a new approach with the posting of a new I/O scheduler that is optimized for solid-state devices. The patch set factors out and generalizes the CFQ code that tracks device usage, but then uses that code to implement a different scheduling algorithm. Avoiding seeks is no longer a concern; neither is the number of bytes transferred. Instead, this scheduler simply tracks the number of I/O operations submitted by each user, trying to equalize the number from each.

The result should be a simpler scheduler that is better suited to solid-state devices. At this point, though, it is hard to say for sure. One of the key rules of kernel patch submission is that performance-oriented changes should be accompanied by benchmark results showing that they achieve the intended goal. This patch had no such results, so nobody knows if it is worth their while to look at the code further or not. Presumably the next submission will provide that information, at which point the real discussion of the new scheduler's merits can begin.

Comments (5 posted)

Kernel development news

A privilege escalation via SCSI pass-through

By Jake Edge
January 4, 2012

One of the important attributes for virtualization is to provide complete isolation between the virtual machines, so that attackers (or bugs) in one VM cannot interfere with the other VMs. But, as a recent bug report shows, the kernel is vulnerable, in some configurations, to VMs that can read and write the disks of other VMs. That's clearly a serious security problem, but the discussion about patches to fix the bug makes it clear that it may take some time before the fix can be applied.

The problem occurs when programs issue the SCSI pass-through SG_IO ioctl() to a particular disk partition (e.g. /dev/sdb2) or LVM volume, which causes the SCSI command to be sent to underlying block device (/dev/sdb). The actual commands that can be sent to the device via SG_IO are filtered for processes that don't have the CAP_SYS_RAWIO capability, but there are still dangerous things that can be done. In particular, if a process can write to the partition, it can write to the underlying device without being restricted to the boundaries of that partition.

For virtualization configurations that mingle partitions or volumes used by different VMs on the same block device, that means that a VM can access—and change—the data on another VM's disk. Worse still, if the host OS stores its own data on that block device, a rogue VM could potentially compromise the host. Exploiting the vulnerability does not require a virtualization (or containerization) scenario, but those are the most likely ways that it could come about. Any process that can open the partition device node will be able to issue the ioctl(), but, on "standard" Linux systems, that ability is typically restricted to root.

Based on the bug report, Paolo Bonzini found the problem back in November 2011, but security problems with SG_IO were known as far back as August 2004. Bonzini posted patches to fix the problem at the end of December (though it would appear that the issue was under discussion on the closed kernel security mailing list in the interim). The proposed fix would disallow most SCSI commands on partition-like devices. So, doing any of the "dangerous" SCSI commands would fail unless the ioctl() is being called on the underlying block device.

The patches sparked a few comments from Linus Torvalds, mostly regarding error return codes (partly because ENOTTY is badly named for its use as an indication of "no such ioctl"). But, beyond that, he started to wonder whether there might be situations where users do issue SCSI commands to partitions and expect them to be passed down to the block device. It turns out that there is at least one place where it may be a common event: "ejecting" USB sticks and other removable media. Torvalds notes:

For example, I just traced it, and "eject /dev/sdb1" does a CDROMEJECT ioctl when used as the root user. I haven't tested the patch, but just reading it, I'd expect it to break that.

And that's the *natural* way to eject a mounted device. Look at the USB memory sticks you have. They are almost all partitioned to have one partition, and that one partition doesn't cover the whole device. And it's that one partition you use to interact with it - it's what you mount, and what you eject.

According to Bonzini, the fact that the CDROMEJECT fails on a kernel with his proposed fix doesn't cause any problems in practice. But Torvalds's concern goes beyond that one particular example. The fix has been suggested for merging late in the 3.2 development cycle and his concern was the level of testing that it has been subjected to: "I absolutely do not get the feeling that this has been tested so much and is so obvious that there is no risk of breakage." Based on the discussion, the testing seems to have been focused on ensuring that the security hole was closed, without considering the other impacts that a—fairly sweeping—change might have.

Torvalds would certainly like to see the vulnerability fixed, but not at the expense of a regression in what users have come to depend on. As he pointed out: "Suddenly totally changing things and saying 'you can't do that on a partition' when clearly people *have* been doing that on partitions isn't something we can do without serious testing." His plan is to wait for the 3.3 merge window to bring in the fix, which should allow some testing time for distributions and others to ensure that the code doesn't have any unintended consequences.

While it is important to fix security holes, it is equally important to keep everything else working, which is the bulk of Torvalds's concern. While the 3.3 development cycle may still not be long enough to shake out all of the places where the SCSI pass-through is used on partial disks (partitions or logical volumes), it certainly will provide more of a chance to do so than would a merge in the final stages of 3.2 development. In the meantime, now that the bug and fix are out in the open, concerned administrators can apply the patch or take other steps to remedy the problem.

Comments (7 posted)

Safe device assignment with VFIO

By Jonathan Corbet
January 3, 2012
As a general rule, most developers feel that device drivers belong in the kernel. Kernel-space drivers are (hopefully) widely reviewed, implement standard device interfaces, perform better, and are more secure than the user-space variety. There are exceptions, though. Some high-performance applications want to talk to devices directly. Virtualized guests can also be thought of as a sort of user-space process; it is often desirable to allow guests to work with hardware directly rather than funneling their I/O through the host. So the kernel really should support this mode of access for the times when it is needed.

The kernel's UIO interface has been available for the implementation of user-space drivers for some years. UIO has some shortcomings, though, including a lack of support for direct memory access (DMA) operations. DMA under user-space control is challenging to support for a number of reasons, not the least of which is security. A DMA-capable device is normally capable of writing any page in memory; as a result, empowering a user-space process to set up DMA operations is equivalent to giving it full root access. Sometimes a user-space driver can be trusted with that access, but that is often not the case, especially when virtualization is involved.

More recent CPUs have added support for safe (or safer) access to devices from virtualized guests. Devices can be restricted, via an I/O memory management unit (IOMMU) so that only specific regions of memory are accessible to them. Technologies like KVM support a "device assignment" mechanism that uses the hardware capabilities to hand a device to a guest, but device assignment is not without its shortcomings. Among other things, device assignment alone cannot guarantee the isolation of a specific device, and it involves a fair amount of complexity in the kernel.

Alex Williamson's VFIO patch set is an attempt to come up with a better solution that allows the development of safe, high-performance user-space drivers. It provides interfaces allowing those drivers to work with DMA and interrupts while keeping overall control over how devices access the system's resources.

One problem with KVM's device assignment is that it assumes that all devices are fully independent of each other. In particular, groups of devices may be connected through the same IOMMU; that means that any device can access any memory regions made available to any other devices in the same group. That, in turn, implies that the group of devices must be assigned as a unit; if any of those devices are assigned separately, the isolation of the group as a whole can be broken.

So the first thing a VFIO driver writer will encounter is the group mechanism. The VFIO code creates the groups to match the hardware topology. It then ensures that every device in a group is controlled by a VFIO driver; if any device is unavailable, then the group as a whole cannot be used. Most devices on a typical system are unlikely to be bound to VFIO drivers at boot, so the system administrator must explicitly unbind them and tell VFIO to claim them. This is probably a good thing; exposing groups of devices to user space is best not done by default.

For each group, a virtual device is created under /dev/vfio; prior to working with any individual device, a driver must open the group, claiming ownership of it. The access permissions on the group file control access to the underlying devices. Once the group has been opened, the driver should do an ioctl(VFIO_GROUP_GET_INFO) call to determine whether the group is "viable" (meaning all of the relevant devices are assigned to it) and available for use. If the group is not viable, the driver will not be able to proceed.

To work with specific devices, the driver will "open" them with the VFIO_GROUP_GET_DEVICE_FD ioctl() call, which returns a file descriptor for access to the device. The VFIO_DEVICE_GET_REGION_INFO command can be used to learn about the device's memory-mapped I/O regions, which can then be accessed via an mmap() call. VFIO_DEVICE_GET_IRQ_INFO returns information about the device's interrupt assignment(s); the driver can use the eventfd() mechanism to receive notification of interrupts via a file descriptor. For most hardware, access to MMIO and interrupts is enough to communicate with the device.

That still leaves the DMA problem, though. To that end, the VFIO_GROUP_GET_IOMMU_FD command returns a file descriptor representing the IOMMU. DMA mappings can be set up by filling in a vfio_dma_map structure:

    struct vfio_dma_map {
	__u32	argsz;
	__u32	flags;
	__u64	vaddr;		/* Process virtual address */
	__u64	iova;		/* IO virtual address */
	__u64	size;		/* Size of mapping (bytes) */

This structure is used to request a mapping of the user-space memory found at vaddr (of size bytes) into the device's I/O memory range starting at iova; the VFIO_IOMMU_MAP_DMA command actually gets the work done. For most user-space drivers, that should be about all that is needed, modulo a few details.

Not all VFIO drivers will be in user space, though. Inside the kernel, VFIO looks like a special bus type to which devices can be bound. A VFIO driver needs to provide a set of operations to the core:

    struct vfio_device_ops {
	bool	(*match)(struct device *dev, const char *buf);
	int	(*claim)(struct device *dev);
	int	(*open)(void *device_data);
	void	(*release)(void *device_data);
	ssize_t	(*read)(void *device_data, char __user *buf,
			size_t count, loff_t *ppos);
	ssize_t	(*write)(void *device_data, const char __user *buf,
			 size_t count, loff_t *size);
	long	(*ioctl)(void *device_data, unsigned int cmd,
			 unsigned long arg);
	int	(*mmap)(void *device_data, struct vm_area_struct *vma);

Most of these operations are analogous to those found in struct file_operations or the bus-specific device structures. A device registered in this way can be opened and used like any other device with one difference: the interlock with group ownership is always enforced. If a device has been opened individually, the group is not "viable" and cannot be used by a user-space driver. If, instead, the group has been opened, the individual devices are busy and cannot be opened.

VFIO is not the only patch set aimed at this problem; David Gibson's device isolation infrastructure is also intended to enable safe assignment of devices. The scope of this patch set is smaller, though, focusing mostly on the grouping aspect; there is no mechanism for controlling the IOMMU or working with individual devices. There is a certain amount of disagreement between the two on how grouping should be managed which suggests, in turn, that a certain amount of discussion will have to take place before either can be merged.

Comments (4 posted)

The logger meets linux-kernel

By Jonathan Corbet
January 4, 2012
Toward the end of December, LWN looked at the new push to move various subsystems specific to Android kernels into the mainline. There seems to be broad agreement that merging this code makes sense, but that agreement becomes rather less clear once the discussion moves to the merging of specific subsystems. Tim Bird's request for comments on the Android "logger" mechanism shows that, even with a relatively simple piece of code, there is still a lot of room for disagreement and problems can turn out to be larger than expected.

The logger is a simple device designed to collect log data from applications and make that data available to developers and the central logging system. It was designed around a number of goals, most of which are driven by the untrusted nature of Android applications: the system wants to allow those applications to send messages to the log in a reliable and trustworthy manner. So applications should not be able to fake who they are, consume too much memory with log data, or spam the log at the expense of data from the kernel or other applications. Writing to the log is also meant to be a fast operation; lots of log data is generated, but little of it is read, so the write operation is the one that should be the fastest.

The driver implements a handful of logging "devices" for different streams; they are known as "main," "events," "radio," and "minor." These devices are hardwired into the code; there is no way to change the set of devices at run time. Each device is implemented as a simple ring buffer in kernel space; writing to the device adds a message to the buffer (annotated with timestamp and process ID information) while any number of readers can pull data out of the buffer. Each log has 256KB of memory dedicated to it; that size, too, is wired into the code. There is a small set of ioctl() operations provided to get the size of the log or the amount of unread logging data stored there, or to flush all data from the log.

The first question to be raised was: why is this facility needed at all? The kernel already has a mechanism for letting user space add entries to the log stream. In mainline kernels, that is done by writing to /dev/kmsg; that device takes the data given to it and passes it straight to printk(). There are a few reasons why Android would not want to use this interface: it does not allow for the separation of logging streams (and, thus, is limited to the root account), it does not add process ID information, and printk() is quite a bit more expensive than simply copying the log data into kernel memory. While adding a duplicate logging infrastructure is obviously problematic, a case could be made that something like logger should be merged and used as a replacement for /dev/kmsg.

That said, "something like logger" need not be the logger itself. If nothing else, it is hard to imagine the current code being merged without corrections for the most obvious problems: the hardwired log streams, for example. Logger is entirely unaware of namespaces, making it unsuited to environments where different process groups may want their own logging streams. It also does not really succeed in keeping processes from spamming the logs and overwriting data from other processes; each of the four streams is isolated from the others, but applications still end up logging to the same streams.

There is also the question of whether the user-space API is the right one. Adding more ioctl() calls to the kernel is never a popular thing to do. Some participants have suggested that an entirely user-space-based logging daemon would work just as well. But a user-space solution has some disadvantages: it would not be able to provide trusted process ID information, and it would likely be slower. Communicating log data to a user-space daemon and adding information like time stamps would involve more system calls than simply logging the data through the kernel. It would also be hard to merge kernel and user-space log data (and, in particular, to ensure that the ordering between events is correct) with a user-space scheme.

Neil Brown suggested that a filesystem interface could be used instead of logger's char device ABI. A logging stream could be established simply by creating a file within that filesystem; regular file permissions could then be used to control access to the streams. The ioctl() calls could be replaced with regular filesystem calls; flushing a log would be done with truncate(), for example. A filesystem-based logger would, clearly, involve moving to a different user-space interface, but it seems like it should be able to satisfy the use case that led to the creation of logger with a more Unix-like ABI.

Android developer Brian Swetland has indicated that there might be interest in trying out an alternative logging implementation, but it would have to meet their needs and it's not something that they would be interested in creating themselves:

If all discussions of bringing Android kernel support to mainline end up as another round of "you should just rewrite it all in userspace" or "you should use this other subsystem that doesn't actually solve your problem but we think it's close enough", then there's not a lot of point to having the discussions in the first place.

If somebody wants to go write a complete compatible replacement that just drops in, we certainly could take a look at it and see how it works, but given that the benefits are not clear to us, we're not interested in going off and doing that work ourselves.

There is another point to consider as well: the longstanding interest in adding a more structured logging interface to the kernel seems to be getting stronger. Adding a new logging mechanism that doesn't address the concerns of translators, documentation writers, and people maintaining automated network management systems is going to be a hard sell. Kay Sievers, one of the developers behind "The Journal," has a plan of his own for logging; it involves adding more structured information to the existing message stream. He thinks there is no need to add a new, incompatible logging mechanism; he is also strongly opposed to the idea of separating message streams as logger does. That separation, he says, just creates problems when messages from multiple sources need to be merged back together in the correct order.

Brian actually seems receptive to some of these ideas, but has expressed concerns around rate-limiting and controlling access to log data. Lennart Poettering believes these needs can be met by the Journal with a little help from control groups. It seems possible that something could be done to make both groups happy - though the code to actually do so has not been posted.

As of this writing, that is where the discussion stands. On the surface, it looks like another piece of Android technology has gotten tangled up in unrelated requirements from various kernel developers, with the result that it will sink like its predecessors. But there is the potential for a number of possible solutions here. One, of course, would be to simply merge a moderately cleaned-up logger for Android's use and be done with it. Another would be to, as Neil suggested, focus on Android's user-space ABI for logging. If that ABI were formalized in a way that would allow multiple underlying implementations, the mainline could merge something other than logger and still be that much closer to being able to run Android's user space.

Perhaps the most intriguing outcome, though, would come if developers from Android, the Journal, and others with an interest in structured logging could get together and agree on what the next generation of logging should look like. Their solution would be likely to start with a critical mass of users from the outset, improving its chances of being merged and adopted by others. Given the number of structured logging efforts that have floundered in the past, getting a widely-accepted solution in place would be a major accomplishment.

Comments (42 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management



Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds