Brief items
The 3.11 kernel is out,
released on
September 2. Some significant features in this release include the
Lustre distributed filesystem, transparent huge page support for the ARM
architecture, Xen and KVM virtualization for ARM64, the
O_TMPFILE
open flag, dynamic power management in the Radeon graphics driver, the
low-latency Ethernet polling patch set, and
more. See
the KernelNewbies
3.11 page for lots of details.
Stable updates:
3.10.10, 3.4.60, and 3.0.94 were released on August 29.
Comments (none posted)
Please people! When you post ssh addresses, always remember to also
post your user name and password or private key with the pull
request.
—
Linus Torvalds
I fail to see the benefit of just using the hardware random number
generator. We are already mixing in the hardware random number
generator into the /dev/random pool, and so the only thing that
using only the HW source is to make the kernel more vulnerable to
an attack where the NSA leans on a few Intel employee and
forces/bribes them to make a change such that the last step in the
RDRAND's AES whitening step is changed to use a counter plus a AES
key known by the NSA.
—
Ted Ts'o
Comments (none posted)
David Herrmann
describes
some recent graphics driver work, in which control of display modes is
being separated from access to the rendering engine. "
So whenever an
application wants hardware-accelerated rendering, GPGPU access or
offscreen-rendering, it no longer needs to ask a graphics-server (via DRI
or wl_drm) but can instead open any available render node and start using
it. Access-control to render-nodes is done via standard file-system
modes. It’s no longer shared with mode-setting resources and thus can be
provided for less-privileged applications."
Comments (none posted)
Greg Kroah-Hartman has put together
a
step-by-step tutorial on how to build and boot a self-signed kernel on
a UEFI secure boot system. "
The first two options here enable EFI
mode, and tell the kernel to build itself as a EFI binary that can be run
directly from the UEFI bios. This means that no bootloader is involved at
all in the system, the UEFI bios just boots the kernel, no “intermediate”
step needed at all. As much as I love gummiboot, if you trust the kernel
image you are running is 'correct', this is the simplest way to boot a
signed kernel."
Comments (none posted)
Kernel development news
By Jonathan Corbet
September 5, 2013
The 3.12 merge window started right on time on September 3; by the time
this article was written, over 3,500 patches had been pulled into the
mainline. There is, once again, a great deal of internal cleanup work
going on that does not look impressive in a feature list, but the benefits
of that work will be felt well into the future. In particular, some of the
performance work that has been done this time around should speed up Linux
considerably in a number of settings.
User-visible features merged for 3.12 so far include:
- The Lustre filesystem, added in 3.11, is now enabled in the build
system. Quite a bit of cleanup work for Lustre has been merged for
3.12.
- The long-deprecated /proc/acpi/event interface has been
removed. If anybody actually needed this file, they should raise a
fuss during the 3.12 development cycle.
- The pstore mechanism (which stores
crash information in a persistent storage location) is now able to
store compressed data.
- The full-system idle detection patch
set has been pulled. This work enables the kernel to detect when the
entire system is idle and turn off the clock tick, thus improving the
performance when the full dynamic tick feature is used.
- The "paravirtualized ticket spinlocks" mechanism allows for more
efficient locking in virtualized guests. In short, if a spinlock is
unavailable for anything more than a brief period, the lock code will
stop spinning and call into the hypervisor to simply wait until the
lock becomes available again.
- New hardware support includes:
- Audio:
Wolfson Microelectronics WM8997 codecs,
Atmel AT91ASM9x5 boards with WM8904 codecs,
TI PCM1792A and PCM1681 codecs,
Asahi Kasei Microdevices AK4554 audio chips,
Renesas R-Car SoC audio controllers, and
Freescale S/PDIF and SSI AC'97 controllers.
- Block:
ATTO Technology ExpressSAS RAID adapters.
The ATA layer has also gained the ability to take advantage of
newer solid-state drives
that support the queued version of the TRIM command, removing
much of the cost of TRIM operations.
- Hardware monitoring and related:
Dialog Semiconductor DA9063 regulators,
Marvell 88PM800 power regulators,
Freescale PFUZE100 PMIC-based regulators, and
Measurement Specialties HTU21D humidity/temperature sensors.
- Miscellaneous:
Humusoft MF624 DAQ PCI cards,
Xillybus generic FPGA interfaces,
Digi EPCA, Neo and Classic serial ports,
ST ASC serial ports,
Nuvoton NAU7802 analog-to-digital converters,
TI TWL6030 analog-to-digital converters,
TI Palmas series pin controllers,
Avago APDS9300 ambient light sensors, and
Bosch BMA180 triaxial acceleration sensors.
- Networking:
Realtek RTL8188EU wireless interfaces.
- Serial peripheral interface:
Freescale DSPI controllers,
Energy Micro EFM32 SoC-based SPI controllers,
Blackfin v3 SPI controllers, and
TI DRA7xxx QSPI controllers.
- USB:
Faraday FOTG210 OTG controllers and
GCT GDM724x LTE chip-based USB modem devices.
Changes visible to kernel developers include:
- There is a new reference count called a "lockref", defined in
<linux/lockref.h>. It combines a spinlock and a
reference count in a way that allows changes to the reference count to
be made without having to take the lock. See this article for details on how
lockrefs work.
- The S390 architecture has been converted to the generic
interrupt-handling mechanism. Since S390 was the last holdout, this
mechanism
will become mandatory and the associated CONFIG_GENERIC_HARDIRQS
configuration option will go away.
- There is a new mechanism for debugging kobject lifecycle issues; it
works by delaying the calling of the release() function when
the reference count drops to zero. Most of the time,
release() is called while the driver is shutting down the
associated device, but there is no guarantee of that. Turning on
CONFIG_DEBUG_KOBJECT_RELEASE will help find cases where the driver is
not prepared for a delayed release() call.
- The PTR_RET() function has been renamed
PTR_ERR_OR_ZERO();
all internal users have been changed.
Your editor predicts that the merge window will close on September 15,
just before the start of LinuxCon and the Linux Plumbers Conference.
Comments (2 posted)
By Jonathan Corbet
September 4, 2013
Reference counts are often used to track the lifecycle of data structures
within the kernel. This counting is efficient, but it can lead to a lot of
cache-line bouncing for frequently-accessed objects. The cost of this
bouncing is made even worse if the reference count must be protected by a
spinlock. The 3.12 kernel will include a new locking primitive called a
"lockref" that, by combining the spinlock and the reference count into a
single eight-byte quantity, is able to reduce that cost considerably.
In many cases, reference counts are implemented with atomic_t
variables that can be manipulated without taking any locks. But the
lockless nature of an atomic_t is only useful if the reference count
can be changed independently of any other part of the reference-counted
data structure.
Otherwise, the structure as a whole must be locked first. Consider, for
example, the heavily-used dentry structure, where reference count
changes cannot be made if some other part of the kernel is working with the
structure. For this reason, struct dentry prior to 3.12
contains these fields:
unsigned int d_count; /* protected by d_lock */
spinlock_t d_lock; /* per dentry lock */
Changing d_count requires acquiring d_lock first. On a
system with a filesystem-intensive workload, contention
on d_lock is a serious performance bottleneck; acquiring the lock
for reference count changes is a significant part of the problem. It would thus
be nice to find a way to to avoid that locking overhead, but it is not
possible to use
atomic operations for d_count, since any thread holding
d_lock must not see the value of d_count change.
The "lockref" mechanism added at the beginning of the 3.12 merge window
allows mostly-lockless manipulation of a reference count while still
respecting an associated
lock; it was originally implemented by
Waiman Long, then modified somewhat
by Linus prior to merging. A lockref works by packing the reference count
and the spinlock into a single eight-byte structure that looks like:
struct lockref {
union {
aligned_u64 lock_count;
struct {
spinlock_t lock;
unsigned int count;
};
};
};
Conceptually, the code works by checking to be sure that the lock is not
held, then incrementing (or decrementing) the reference count while
verifying that no other thread takes the lock while the change is
happening. The key to this operation is the magic cmpxchg()
macro:
u64 cmpxchg(u64 *location, u64 old, u64 new);
This macro maps directly to a machine instruction that will store the
new value into *location, but only if the current value
in *location matches old. In the lockref case, the
location is the lock_count field in the structure, which
holds both the spinlock and the reference count. An increment operation
will check the state of the lock, compute the new reference count, then use
cmpxchg() to atomically store the new value, insuring that neither
the count nor the lock has changed in the meantime. If things do
change, the code will either try again or fall back to old-fashioned
locking, depending on whether the lock is free or not.
This trickery allows reference count changes to be made (most of the time)
without actually acquiring the spinlock and, thus, without contributing to
lock contention. The associated performance improvement can be impressive
— a factor of six, for example, with one of
Waiman's benchmarks testing filesystem performance on a large system.
Given that the new lockref code is
only being used in one place (the dentry cache), that is an impressive
return from a relatively small amount of changed code.
At the moment, only 64-bit x86 systems have a full lockref implementation.
It seems likely, though, that other architectures will gain support by the
end of the 3.12 development cycle, and that lockrefs will find uses in
other parts of the kernel in later cycles. Meanwhile, the focus on lock
overhead has led to improvements elsewhere
in the filesystem layer that should make their way in during this merge
window; it has also drawn attention to some other places where the locking
can clearly be improved with a bit more work. So, in summary, we will see
some significant
performance improvements in 3.12, with more to come in the near future.
Comments (10 posted)
September 4, 2013
This article was contributed by John Stultz
As part of the Android + Graphics micro-conference at the
2013 Linux Plumbers
Conference, we'll be discussing
the
ION memory allocator and how its
functionality might be upstreamed to the mainline
kernel. Since time will be limited, I
wanted to create some background documentation to try to provide context
to the issues we will discuss and try to resolve at the micro-conference.
ION overview
The main goal of Android's ION subsystem is to allow for allocating and
sharing of buffers between hardware devices and user space in order to
enable zero-copy memory sharing between devices.
This sounds simple enough, but in practice it's a difficult problem. On
system-on-chip (SoC) hardware, there are usually many different devices that have direct
memory access (DMA). These devices, however, may have different
capabilities and can view and access memory with different constraints.
For example, some devices may handle scatter-gather lists, while others
may be able to only access physically contiguous pages in memory. Some
devices may have access to all of memory, while others may only access a
smaller portion of memory. Finally, some devices might sit behind an
I/O memory management unit (IOMMU), which may require configuration to give
the device access to
specific pages in memory.
If you have a buffer that you want to share with a device, and the
buffer isn't allocated in memory that the device can access, you have to
use bounce
buffers to copy the contents of that memory over to a location where the
other devices can access
it. This can be expensive and greatly hurt performance. So the ability to
allocate a buffer in a location accessible by all the devices
using the buffer is important.
Thus ION provides an interface that allows for centralized allocation of
different "types" of memory (or "heaps").
In current kernels without ION, if you're trying to share memory
between a DRM graphics device and a video4linux (V4L) camera, you
need to be sure to allocate the memory using the subsystem that manages
the most-constrained device. Thus, if the camera is the most constrained
device, you need to do your allocations via the V4L kernel interfaces,
while if the graphics is the most constrained device, you have to do the
allocations via the Graphics Execution Manager (GEM) interfaces.
ION instead provides one single centralized interface that allows
applications to allocate memory that satisfies the required constraints.
One thing that ION doesn't provide, though, is a method for determining what
type of memory satisfies the constraints of the relevant hardware.
This is instead a problem left to the device-specific user-space
implementations doing the allocation ("Gralloc," in the case of Android).
This hard-coded constraint solving isn't ideal, but there are not
any better mainline solutions for allocating buffers with GEM and
V4L. User space just has to know what is the most-constrained device. On
mostly static hardware devices, like phones and tablets, this information
is known ahead of time, but this limitation makes ION less suitable for
upstream adoption in its current form.
To share these buffers, ION exports a file descriptor which is linked
to a specific buffer. These file descriptors can be then passed between
applications and to ION-enabled drivers. Initially these were ION-specific
file descriptors, but ION has since been reworked to utilize
dma-buf structures for sharing. One caveat is that while ION can export
dma-bufs it won't import dma-bufs exported from other drivers.
ION cache management
Another major role that ION plays as a central buffer allocator and
manager is handling cache maintenance for DMA. Since many devices
maintain their own memory
caches, it's important that, when serializing device and CPU access to shared
memory, those devices and CPUs flush their private caches before letting other devices
access the buffers. Providing a full background on caching would be out
of the scope of this article, so I'll instead point folks to this LWN
article if they are interested in learning more.
ION allows for buffer users to set a flag describing the needed cache
behavior on allocations.
This allows those users to specify if mappings to the buffer should be cached
with ION doing the cache maintenance, if the buffers will be uncached but
use write-combining (see this article for details),
or if the buffers will be uncached and managed explicitly via ION's
synchronization ioctl().
In the case where the buffers are cached and ION performs cache
maintenance, ION further tries to allow for optimizations by delaying
the creation of any mappings at mmap() time. Instead, it provides a fault handler
so pages are mapped in only when they are accessed. This method allows ION to
keep track of the changed pages and only flush pages that were
actually touched.
Also, when ION allocates memory for uncached buffers, it is managing
physical pages which aren't mapped into kernel space yet. Since these
buffers may be used by DMA before they are mapped into kernel space, it is
not correct to flush them at mapping time; that could result in data
corruption.
So these buffers have to be pre-flushed for DMA when they are allocated. So
another performance optimization ION provides is that it pre-flushes pools
of pages for DMA. On some systems, flushing memory for DMA on frequent
small buffer allocations is a major performance penalty. Thus ION uses a
page pool, which allows a large pool of uncached pages to be
pre-allocated and flushed all at once, then when smaller allocations are
made they just pick pages from the pool.
Unfortunately both of these optimizations are somewhat problematic from
an upstream perspective.
Delayed mapping creation is problematic because the DMA API uses either
scatter-gather
lists or larger contiguous DMA areas; there isn't a generic
interface to flush a single page. Because of this, when ION tries to
flush only the pages that have been touched, it ends up using the
ARM-specific __dma_page_cpu_to_dev() function, as it was too costly to
iterate across the scatter-gather lists to find the faulted page. The
use of this interface makes ION only buildable on 32-bit ARM systems.
The pre-flushed page pools are also problematic: since these pools
of memory are allocated ahead of time, it's not necessarily clear which
device is going to be using them. Normally, when flushing pages for DMA,
one must specify the device which will access the memory next, so in the
case of a device behind an IOMMU, that IOMMU can be set up so the device can
access those pages. ION gets away with this again by using the 32-bit
ARM-specific __dma_page_cpu_to_dev() interface, which does not
take a device
argument. Thus this further limits ION's ability to function in more
generic environments where IOMMUs are more common.
For Android's uses, this limitation isn't problematic. 32-bit ARM is its
main target, and, on Intel systems there is coherent memory and fewer
device-specific constraints, so ION isn't needed there. Further, for
Android's use cases, IOMMUs can be statically configured to specific heaps
(boot-time reserved carve-out memory, for example) so it's not necessary to
dynamically reconfigure the IOMMUs. But these limitations are problematic for
getting ION upstream. The problem is that without these optimizations
the performance penalty will be too high, so Android is unlikely to
make use of more upstream-friendly approaches that leave out these
optimizations.
Other ION details
Since ION is a centralized allocator, it has to be somewhat flexible in
order to handle all the various
types of hardware. So ION allows
implementations to define their own heaps beyond the common heaps
provided by default. Also, since many devices can have quirky allocation
rules, such as allocating on specific DIMM banks, ION allows some of the
allocation flags to be defined by the heap implementation.
It also provides an ION_IOC_CUSTOM ioctl() multiplexer which
allows ION implementations to implement their own buffer operations,
such as finer-grained cache management or special allocators. However,
the downside to this is that it makes the ION interface actually quite
hardware-specific — in some cases, specific devices require fairly large
changes to the ION core. As a result, user-space applications that use the
ION interface must be customized to use the specific ION implementation for
the hardware they are running on. Again, this isn't really a problem for
embedded devices where kernels and user space are delivered together, so
strict ABI consistency isn't required, but is an issue for merging
upstream.
This hardware- and implementation-specific nature of ION also brings into
question the viability of the centralized allocator approach ION uses.
In order to enable the various features of all the different
hardware, it basically has hardware-specific interfaces, forcing the
writing of
hardware-specific user-space applications. This removes some of the conceptual
benefit of having a centralized allocator rather than using device
specific allocators. However, the Android developers have reasoned that,
by having a ION be a centralized memory manager, they can
reduce the amount of complex code each device driver has to implement
and allows for optimizations to be made once in the core, rather than
over and over to various drivers of differing quality.
To summarize the issues around ION:
- It does not provide a method to discover device constraints.
- The interface exposes hardware-specific heap IDs to user space.
- The centralized interface isn't sufficiently generic for all devices, so
it exposes an ioctl() multiplexer for device-specific options.
- ION only imports dma-bufs from itself.
- It doesn't properly use the DMA API, failing to specify a device when
flushing caches for DMA.
- ION only builds on 32-bit ARM systems.
ION compared to current upstream solutions
In some ways GEM is a similar memory allocation and sharing system. It
provides an API for allocating graphics buffers that can be used by an
application to communicate with graphics drivers. Additionally, GEM
provides a way for an application to pass an allocated buffer to another
process. To do this one uses the DRM_IOCTL_GEM_FLINK operation,
which provides a GEM-specific reference that is conceptually similar to a
file descriptor
that can be passed to another process over a socket. One drawback with
this is that these GEM-specific "flink" references are just a global 32-bit
value, and thus can be guessed by applications which otherwise should not
have access to them. Another problem with GEM-allocated buffers is that
they are specific to the device they were allocated for. Thus, while GEM
buffers could be shared between applications, there is no way to share
GEM buffers between different devices.
With the advent of hybrid graphics implementations (usually discrete
NVIDIA GPUs combined with integrated Intel GPUs), the need for sharing
buffers between devices arose and dma-bufs and PRIME
(a GEM-specific mechanism for sharing buffers between devices) were created.
For the most part, dma-bufs can be considered to be marshaling structures for
buffers. The dma-buf system doesn't provide any method for allocation,
but provides a generic structure that can be used to to share buffers
between a number of different devices and applications. The dma-buf
structures are shared to user space using a file descriptor, which avoids
the potential security issues with GEM flink IDs.
The DRM PRIME infrastructure allows drivers to share GEM buffers via
dma-bufs, which allows for things like having the Nouveau driver be able
to render directly into a buffer that the Intel driver will display to
the screen. In this way GEM and PRIME together provide functionality
similar to that of ION, allowing for the type of buffer sharing (both
utilizing dma-bufs) that ION enables on SoCs on more conventional
desktop machines. However PRIME does not handle any
information about what kind of memory the device can access, it just
allows for GEM drivers to utilize dma-buf sharing, assuming all the devices
sharing the buffer can access it.
The V4L subsystem, which is used for cameras and video recorders, also has
integrated dma-buf functionality, allowing camera buffers to be
shared with graphics cards and other devices. It provides its own
allocation interfaces but, like GEM, these allocation interfaces only
make sure the buffer being allocated works with the device that the
driver manages, and are unaware of what constraints other drivers the
buffer might be shared with have.
So with the current upstream approach, in order to share buffers between
devices, user space must know which devices will share the buffer
and which device has the most restrictive constraints; it must then allocate
the buffer using the API that most-constrained driver uses.
Again, much like in the ION case, user space has no
method available in order to determine which device is most-constrained.
The upstream issues can thus be summarized this way:
- There is no existing solution for constraint-solving for sharing buffers
between devices.
- There are different allocation APIs for different devices, so, once users
determine the most constrained device, they have to then do allocation
with the matching API for that device.
- The IOMMU and DMA API interfaces do not currently allow for the DMA
optimizations used in ION.
Possible solutions
Previously, when ION has been discussed in the community, there have
been a few potential approaches proposed. Here are the
ones I'm aware of.
One idea would be to try to just merge a centralized ION-like allocator
upstream, keeping a similar interface. To address the problematic
constraint discoverability issue, devices would export a opaque heap
cookie via sysfs and/or via an ioctl(), depending on the device's needs
(devices could have different requirements depending on device-specific
configuration). The meaning of the bits would not be defined to
user space, but could be ANDed together by a user-space application and passed to the
allocator, much as the heap mask is currently with ION. This provides a
way for user space to do the constraint solving but avoids the problem of
fixing heap types into the ABI; it also allows the kernel to define which
bits mean which heap for a given machine, making the interface
more flexible and extensible. This, however, is a more complicated
interface for user space to use, and many do not like the idea of exposing
the constraint information to user space, even in the form of an opaque cookie.
Another possible solution
is to allow dma-buf exporters to not allocate the backing buffers
immediately. This would allow multiple drivers to attach to a dma-buf
before the allocation occurs. Then, when the buffer is first used,
the allocation is done; at that time, the allocator could scan the list of
attached drivers and be able to determine the constraints of the
attached devices and allocate memory accordingly. This would allow
user space to not have to deal with any constraint solving.
While this
approach was planned for when dma-bufs were originally designed, much
needed infrastructure is still missing and no drivers yet use this
solution. The Android developers have raised the concern that this sort
of delayed allocation could cause non-deterministic latency in
application hot-paths, though, without an implementation, this has not
yet been quantified. Another downside is that this delayed
allocation isn't required of all dma-buf exporters, so it would only
work with drivers that actually implement this feature.
Since the possibility exists that not all of the drivers one would want to
share a buffer with would support delayed allocation, applications would have to
somehow detect the functionality and make sure to allocate memory to be
shared using the dma-buf exporter that does support this delayed
allocation functionality. This approach also requires the exporter driver
allocators to each handle this constraint solving individually (though
common helper functions may be something that could be provided).
Another possible approach could be to prototype the dma-buf late-allocation
constraint solving using a generic dma-buf exporter. This in
some ways would be ION-like in that it would be a centralized exporter,
but would not expose heap IDs to user space. Then the buffer would be
attached to the various hardware drivers, and, on the first use, the
exporter would determine the attached constraints and allocate the
buffer. This would provide a testing ground for the delayed-allocation
approach above while having some conceptual parallels to ION. The
downside to this approach would be that the centralized interface would
likely not be able to address the more intricate hardware-specific
allocation flags that possibly could be needed.
Finally, none of these proposals address the non-generic caching
optimizations ION uses, so those issues will have to be discussed further.
Conference Discussion
I suspect at the Linux Plumbers Android + Graphics mini-conference, we
won't find a magic or easy solution on how to get Android's ION
functionality upstream. But I hope that, by having key developers from
both the Android team and the upstream kernel community able to discuss
their needs and constraints and be able to listen to each other, we
might be able to get a sense of which subproblems have to be addressed
and what direction forward we might be able to take. To this end I've
created a few questions for folks to think about and discuss so we can
hopefully come up with answers for during the discussion:
- Current upstream dma-buf sharing uses, such as PRIME, seem focused on
x86 use cases (such as two devices sharing buffers). Will these interfaces
really scale to ARM-style uses cases (many devices sharing buffers) in a
generic fashion? As non-centralized allocation requires exporters to
manage more logic and understand device constraints, and there seems to
be a potential that this approach will eventually be unmaintainable.
- ION's centralized allocation style is problematic in many cases, but
also provides significant performance gains. Is this too major of an
impasse or is there a way forward?
- What other potential solutions haven't yet been considered?
- If a centralized dma-buf allocation API is the way forward, what would
be the best approach (ie: heap-cookies, vs post-attach allocation)?
- Is there any way to implement some of the caching optimizations ION
uses in a way that is also more generically applicable? Possibly extending
IOMMU/DMA-API?
- Given Android's needs, what next steps could be done to converge on a
solution? How can we test to see if attach-time solving will be usable
for Android developers? What would it miss that ION still provides?
- How do Android developers plan to deal with IOMMUs and non-32-bit ARM
architecture issues?
Some general reference links follow:
Credits
Thanks to Laurent Pinchart, Jesse Barker, Benjamin Gaignard and Dave
Hansen for reviewing and providing feedback on early drafts of this
document, and many thanks to Jon Corbet for his
careful editing.
Comments (none posted)
Patches and updates
Kernel trees
- Sebastian Andrzej Siewior: 3.10.10-rt7 .
(August 31, 2013)
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>