Brief items
The current development kernel is 3.2-rc7,
released on December 23. As of this
writing, the final 3.2 release is imminent; one assumes that Linus is
waiting for the LWN Weekly Edition to be published first.
Update: the 3.2 release did, indeed,
happen at the predicted time.
Stable updates: the 2.6.32.52, 3.0.15 and 3.1.7 updates were released on January 3;
they contain a single fix for a resume problem experienced by some users.
The 2.6.32.53, 3.0.16, and 3.1.8 updates are in the review process as of
this writing. They contain a longer list of fixes, and can be expected on
or after January 6.
Comments (2 posted)
We're beyond the point where any additional kernel complexity
should be considered a regression.
--
Andrew Morton
It would be nice if kernel developers understood that GFP_KERNEL
is strongly preferred and that they should put in effort to use
it. But there's a strong tendency for people to get themselves
into a sticky corner then take the easy way out, resulting in less
robust code. Maybe calling the function
alloc_percpu_i_really_suck() would convey the hint.
--
Andrew Morton
But hey, the development kernels are still way more interesting
than some boring stable kernel. New and exciting features, and you
can feel like you're living at the edge rather than puttering
around with your grandmothers OS. So pthhththt at you, Greg.
--
Linus
Torvalds
Weighing all that up, I don't think it is useful to set our goal on
"getting Android to use a mainline kernel" - that isn't going to
happen. Rather we should focus primarily on "making it *possible*
to run android user-space on mainline".
--
Neil Brown
Comments (none posted)
By Jonathan Corbet
January 4, 2012
I/O schedulers are charged with ordering block I/O operations in a way that
maximizes throughput to the device and, perhaps, implementing the system's
policy with regard to how the available bandwidth should be divided. The
schedulers currently in use in Linux were designed with rotating storage in
mind, with the result that they are concerned with avoiding disk seeks and
tracking the number of bytes transferred. With solid-state devices,
though, I/O locality is (nearly) irrelevant and the number of I/O operations
performed is considered to be a better measurement of the amount of device
capacity used. The kernel's CFQ scheduler has been evolving to deal better
with solid-state devices, but everybody agrees there is more to be done.
Shaohua Li has taken a new approach with the posting of a new I/O scheduler that is optimized for
solid-state devices. The patch set factors out and generalizes the CFQ
code that tracks device usage, but then uses that code to implement a
different scheduling algorithm. Avoiding
seeks is no longer a concern; neither is the number of bytes transferred.
Instead, this scheduler simply tracks the number of I/O operations
submitted by each user, trying to equalize the number from each.
The result should be a simpler scheduler that is better suited to
solid-state devices. At this point, though, it is hard to say for sure.
One of the key rules of kernel patch submission is that
performance-oriented changes should be accompanied by benchmark results
showing that they achieve the intended goal. This patch had no such
results, so nobody knows if it is worth their while to look at the code
further or not. Presumably the next submission will provide that
information, at which point the real discussion of the new scheduler's
merits can begin.
Comments (4 posted)
Kernel development news
By Jake Edge
January 4, 2012
One of the important attributes for virtualization is to provide complete
isolation between the virtual machines, so that attackers (or bugs) in one
VM cannot interfere with the other VMs. But, as a recent bug report
shows, the kernel is vulnerable, in some configurations, to VMs that can
read and write the disks of other VMs. That's clearly a serious security
problem, but the discussion about patches to fix the bug makes it
clear that it may take some time before the fix can be applied.
The problem occurs when programs issue the SCSI pass-through SG_IO
ioctl()
to a particular disk partition (e.g. /dev/sdb2) or LVM volume,
which causes the
SCSI command to be sent to underlying block device (/dev/sdb).
The actual commands that can be sent to the device via SG_IO are filtered for
processes that don't have the CAP_SYS_RAWIO capability, but there
are still dangerous things that can be done. In particular, if a process
can write to the partition, it can write to the underlying device without
being restricted to the boundaries of that partition.
For
virtualization configurations that mingle partitions or volumes used by
different VMs
on the same block device, that means that a VM can access—and
change—the data on another VM's disk. Worse still, if the host OS
stores its own data on that block device, a rogue VM could potentially
compromise the host.
Exploiting the vulnerability does not require a virtualization (or
containerization) scenario, but those are the most likely ways that it
could come about. Any
process that can open the partition
device node will be able to issue the ioctl(), but, on "standard" Linux
systems, that ability is typically restricted to root.
Based on the bug report, Paolo Bonzini found the problem back in November 2011,
but security problems with SG_IO were known as far back as August 2004.
Bonzini
posted patches to fix the problem at
the end of December (though it would appear that the issue was under
discussion on the closed kernel security mailing list in the interim). The
proposed fix would disallow most SCSI commands on partition-like devices.
So, doing any of the "dangerous" SCSI commands would fail
unless the ioctl() is being called on the underlying block device.
The
patches sparked a few comments from Linus Torvalds, mostly regarding error
return codes (partly because ENOTTY is badly named for its use as
an indication of "no such ioctl"). But, beyond that, he started to wonder
whether there might be situations where users do issue SCSI commands
to partitions and expect them to be passed down to the block device. It
turns out that there is at least one place
where it may be a common
event: "ejecting" USB sticks and other removable media. Torvalds notes:
For example, I just traced it, and "eject /dev/sdb1" does a CDROMEJECT
ioctl when used as the root user. I haven't tested the patch, but just
reading it, I'd expect it to break that.
And that's the *natural* way to eject a mounted device. Look at the
USB memory sticks you have. They are almost all partitioned to have
one partition, and that one partition doesn't cover the whole device.
And it's that one partition you use to interact with it - it's what
you mount, and what you eject.
According to Bonzini, the fact that the
CDROMEJECT fails on a kernel with his proposed fix doesn't cause
any problems in practice. But Torvalds's concern goes beyond that one
particular example. The fix has been suggested for merging late in the 3.2
development cycle and his concern was the level of testing that it has been
subjected to: "I absolutely do not get the feeling that this has been tested so much
and is so obvious that there is no risk of breakage." Based on the
discussion, the testing seems to have been focused on ensuring that the
security hole was closed, without considering the other impacts that a—fairly sweeping—change might have.
Torvalds would certainly like to see the vulnerability fixed, but not at
the expense of a regression in what users have come to depend on. As he pointed out: "Suddenly
totally changing things and saying 'you can't do that on a partition'
when clearly people *have* been doing that on partitions isn't
something we can do without serious testing." His plan is to wait
for the 3.3 merge window to bring in the fix, which should allow some
testing time for distributions and others to ensure that the code doesn't
have any unintended consequences.
While it is important
to fix security holes, it is equally important to keep everything else
working, which is the bulk of Torvalds's concern. While the 3.3
development cycle may still not be long enough to shake out all of
the places where the SCSI pass-through is used on partial
disks (partitions or logical volumes), it certainly will provide more of a
chance to do so than would a merge in the final stages of 3.2 development.
In the meantime, now that the bug and fix are out in the open, concerned
administrators can apply the patch or take other steps to remedy the problem.
Comments (7 posted)
By Jonathan Corbet
January 3, 2012
As a general rule, most developers feel that device drivers belong in the
kernel. Kernel-space drivers are (hopefully) widely reviewed, implement
standard device interfaces, perform better, and are more secure than the
user-space variety. There are exceptions, though. Some high-performance
applications want to talk to devices directly. Virtualized guests can also
be thought of as a sort of user-space process; it is often desirable to
allow guests to work with hardware directly rather than funneling their
I/O through the host. So the kernel really should support this mode of
access for the times when it is needed.
The kernel's UIO interface has been
available for the
implementation of user-space drivers for some years. UIO has some
shortcomings, though, including
a lack of support for direct memory access (DMA) operations. DMA under
user-space control is challenging to support for a number of reasons, not
the least of which is security. A DMA-capable device is normally capable
of writing any page in memory; as a result, empowering a user-space
process to set up DMA operations is equivalent to giving it full root
access. Sometimes a user-space driver can be trusted with that access, but
that is often not the case, especially when virtualization is involved.
More recent CPUs have added support for safe (or safer) access to devices
from virtualized guests. Devices can be restricted, via an I/O memory
management unit (IOMMU) so that only specific regions of memory are
accessible to them. Technologies like KVM support a "device assignment"
mechanism that uses the hardware capabilities to hand a device to a guest,
but device assignment is not without its shortcomings. Among other things,
device assignment alone cannot guarantee the isolation of a specific
device, and it involves a fair amount of complexity in the kernel.
Alex Williamson's VFIO patch set is an
attempt to come up with a better solution that allows the development of
safe, high-performance user-space drivers. It provides interfaces allowing
those drivers to work with DMA and interrupts while keeping overall control
over how devices access the system's resources.
One problem with KVM's device assignment is that it assumes that
all devices are fully independent of each other. In particular, groups
of devices may be connected through the same IOMMU; that means that any
device can access any memory regions made available to any other devices in
the same group. That, in turn, implies that the group of devices must be
assigned as a unit; if any of those devices are assigned separately, the
isolation of the group as a whole can be broken.
So the first thing a VFIO driver writer will encounter is the group
mechanism. The VFIO code creates the groups to match the hardware
topology. It then ensures that every device in a group is controlled by a
VFIO driver; if any device is unavailable, then the group as a
whole cannot be used. Most devices on a typical system are unlikely to be
bound to VFIO drivers at boot, so the system administrator must explicitly
unbind them and tell VFIO to claim them. This is probably a good thing;
exposing groups of devices to user space is best not done by default.
For each group, a virtual device is created under /dev/vfio; prior
to working with any individual device, a driver must open the group,
claiming ownership of it. The access permissions on the group file control
access to the underlying devices. Once the group has been opened, the
driver should do an ioctl(VFIO_GROUP_GET_INFO) call to determine
whether the group is "viable" (meaning all of the relevant devices are
assigned to it) and available for use. If the group is not viable, the
driver will not be able to proceed.
To work with specific devices, the driver will "open" them with the
VFIO_GROUP_GET_DEVICE_FD ioctl() call, which returns a
file descriptor for access to the device. The
VFIO_DEVICE_GET_REGION_INFO command can be used to learn about the
device's memory-mapped I/O regions, which can then be accessed via an
mmap() call. VFIO_DEVICE_GET_IRQ_INFO returns
information about the device's interrupt assignment(s); the driver can use
the eventfd() mechanism to receive notification of interrupts via
a file descriptor. For most hardware, access to MMIO and interrupts is
enough to communicate with the device.
That still leaves the DMA problem, though. To that end, the
VFIO_GROUP_GET_IOMMU_FD command
returns a file descriptor representing the IOMMU. DMA mappings can be set
up by filling in a vfio_dma_map structure:
struct vfio_dma_map {
__u32 argsz;
__u32 flags;
__u64 vaddr; /* Process virtual address */
__u64 iova; /* IO virtual address */
__u64 size; /* Size of mapping (bytes) */
};
This structure is used to request a mapping of the user-space memory found
at vaddr (of size bytes) into the device's I/O memory
range starting at iova; the VFIO_IOMMU_MAP_DMA command
actually gets the work done.
For most user-space drivers, that should be about all that is needed,
modulo a few details.
Not all VFIO drivers will be in user space, though. Inside the kernel,
VFIO looks like a special bus type to which devices can be bound. A VFIO
driver needs to provide a set of operations to the core:
struct vfio_device_ops {
bool (*match)(struct device *dev, const char *buf);
int (*claim)(struct device *dev);
int (*open)(void *device_data);
void (*release)(void *device_data);
ssize_t (*read)(void *device_data, char __user *buf,
size_t count, loff_t *ppos);
ssize_t (*write)(void *device_data, const char __user *buf,
size_t count, loff_t *size);
long (*ioctl)(void *device_data, unsigned int cmd,
unsigned long arg);
int (*mmap)(void *device_data, struct vm_area_struct *vma);
};
Most of these operations are analogous to those found in struct
file_operations or the bus-specific device structures. A device
registered in this way can be opened and used like any other device with one
difference: the interlock with group ownership is always enforced. If a
device has been opened individually, the group is not "viable" and cannot
be used by a user-space driver. If, instead, the group has been opened,
the individual devices are busy and cannot be opened.
VFIO is not the only patch set aimed at this problem; David Gibson's device isolation infrastructure is also
intended to enable safe assignment of devices. The scope of this patch set
is smaller, though, focusing mostly on the grouping aspect; there is no
mechanism for controlling the IOMMU or working with individual devices.
There is a certain amount of disagreement between the two on how grouping
should be managed which suggests, in turn, that a certain amount of
discussion will have to take place before either can be merged.
Comments (4 posted)
By Jonathan Corbet
January 4, 2012
Toward the end of December, LWN
looked at the
new push to move various subsystems specific to Android kernels into
the mainline. There seems to be broad agreement that merging this code
makes sense, but that agreement becomes rather less clear once the
discussion moves to the merging of specific subsystems. Tim Bird's
request for comments on the Android "logger"
mechanism shows that, even with a relatively simple piece of code, there is
still a lot of room for disagreement and problems can turn out to be larger
than expected.
The logger is a simple device designed to collect log data from
applications and make that data available to developers and the central logging
system. It was designed around a number of goals, most of which are driven
by the untrusted nature of Android applications: the system wants to allow
those applications to send messages to the log in a reliable and
trustworthy manner. So applications should not be able to fake who they
are, consume too much memory with log data, or spam the log at the expense
of data from the kernel or other applications. Writing to the log is also
meant to be a fast operation; lots of log data is generated, but little of it
is read, so the write operation is the one that should be the fastest.
The driver implements a handful of logging "devices" for different streams;
they are known as "main," "events," "radio," and "minor." These devices
are hardwired into the code; there is no way to change the set of devices
at run time. Each device is implemented as a simple ring buffer in kernel
space; writing to the device adds a message to the buffer (annotated with
timestamp and process ID information) while any number of readers can pull
data out of the buffer. Each log has 256KB of memory dedicated to it;
that size, too, is wired into the code. There is a small set of
ioctl() operations provided to get the size of the log or the amount
of unread logging data stored there, or to flush all data from the log.
The first question to be raised was: why is this facility needed at all?
The kernel already has a mechanism for letting user space add entries to
the log stream. In mainline kernels, that is done by writing to
/dev/kmsg; that device takes the data given to it and passes it
straight to printk(). There are a few reasons why Android would
not want to use this interface: it does not allow for the separation of
logging streams (and, thus, is limited to the root account), it does not
add process ID information, and
printk() is quite a bit more expensive than simply copying the
log data into kernel memory. While adding a duplicate logging
infrastructure is obviously problematic, a case could be made that
something like logger should be merged and used as a replacement for
/dev/kmsg.
That said, "something like logger" need not be the logger itself. If
nothing else, it is hard to imagine the current code being merged without
corrections for the most obvious problems: the hardwired log streams, for
example. Logger is entirely unaware of namespaces, making it unsuited
to environments where different process groups may want their own logging
streams. It also does not really succeed in keeping processes from
spamming the logs and overwriting data from other processes; each of the
four streams is
isolated from the others, but applications still end up logging to the same
streams.
There is also the question of whether the user-space API is the right one.
Adding more ioctl() calls to the kernel is never a popular thing
to do. Some participants have suggested that an entirely user-space-based
logging daemon would work just as well. But a user-space solution has some
disadvantages: it would not be able to provide trusted process ID
information, and it would likely be slower. Communicating log data to a
user-space daemon and adding information like time stamps would involve
more system calls than simply logging the data through the kernel. It
would also be hard to merge kernel and user-space log data (and, in
particular, to ensure that the ordering between events is correct) with
a user-space scheme.
Neil Brown suggested that a
filesystem interface could be used instead of logger's char device ABI. A
logging stream could be
established simply by creating a file within that filesystem; regular file
permissions could then be used to control access to the streams. The
ioctl() calls could be replaced with regular filesystem calls;
flushing a log would be done with truncate(), for example. A
filesystem-based logger would, clearly, involve moving to a different
user-space interface, but it seems like it should be able to satisfy the
use case that led to the creation of logger with a more Unix-like ABI.
Android developer Brian Swetland has indicated that there might be interest in
trying out an alternative logging implementation, but it would have to meet
their
needs and it's not something that they would be interested in creating
themselves:
If all discussions of bringing Android kernel support to mainline
end up as another round of "you should just rewrite it all in
userspace" or "you should use this other subsystem that doesn't
actually solve your problem but we think it's close enough", then
there's not a lot of point to having the discussions in the first
place.
If somebody wants to go write a complete compatible replacement
that just drops in, we certainly could take a look at it and see
how it works, but given that the benefits are not clear to us,
we're not interested in going off and doing that work ourselves.
There is another point to consider as well: the longstanding interest in
adding a more structured logging interface to the kernel seems to be
getting stronger. Adding a new logging mechanism that doesn't address the
concerns of translators, documentation writers, and people maintaining
automated network management systems is going to be a hard sell. Kay
Sievers, one of the developers behind "The Journal," has a plan of his own for logging; it involves
adding more structured information to the existing message stream. He
thinks there is no need to add a new, incompatible logging mechanism; he is
also strongly opposed to the idea of separating message streams as logger
does. That separation, he says, just creates problems when messages from
multiple sources need to be merged back together in the correct order.
Brian actually seems receptive to some of
these ideas, but has expressed concerns around rate-limiting and
controlling access to log data. Lennart Poettering believes these needs can be met by the Journal
with a little help from control groups. It seems possible that something
could be done to make both groups happy - though the code to actually do so
has not been posted.
As of this writing, that is where the discussion stands. On the surface,
it looks like another piece of Android technology has gotten tangled up in
unrelated requirements from various kernel developers, with the result that
it will sink like its predecessors. But there is the potential for a
number of possible solutions here. One, of course, would be to simply
merge a moderately cleaned-up logger for Android's use and be done with
it. Another would be to, as Neil suggested, focus on Android's user-space ABI
for logging. If that ABI were formalized in a way that would allow
multiple underlying implementations, the mainline could merge something
other than logger and still be that much closer to being able to run
Android's user space.
Perhaps the most intriguing outcome, though, would come if developers from
Android, the Journal, and others with an interest in structured logging
could get together and agree on what the next generation of logging should
look like. Their solution would be likely to start with a critical mass of
users
from the outset, improving its chances of being merged and adopted by
others. Given the number of structured logging efforts that have
floundered in the past, getting a widely-accepted solution in place would
be a major accomplishment.
Comments (42 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>