Brief items
The current development kernel remains 3.3-rc2; there have been no
3.3 prepatches released in the last week.
The Stable updates picture is somewhat more complicated.
2.6.32.56,
3.0.19,
and 3.2.3 were released on February 3
with a long list of patches.
3.2.4 followed shortly thereafter to fix a
build failure introduced in 3.2.3.
On February 6, 3.0.20 and 3.2.5 were released. These were single-patch updates containing the fix to the ASPM-related problem that would significantly
increase power consumption on some systems. This patch has been treated
with some care: it seems to work, but nobody really knows if it might cause
behavioral problems on some obscure hardware. That said, at this point, it
seems safe enough to have found its way into a stable update.
Comments (none posted)
Note, I also removed the line in the unusedcode.easy file at the same
time, if I shouldn't have done that, let me know and I'll redo these
patches.
If I messed anything up, or the patches need more information within the
body of the changelog, please let me know, and I'll be glad to respin
them.
--
Greg
Kroah-Hartman becomes a LibreOffice developer.
If we really want to improve the world we should jump into a time
machine and set tabstops to 4.
--
Andrew Morton
10? We have a few cases of variables over 100 (not sure how you are supposed
to use them with an 80 character max line length). Current longest is:
eOpenLogicalChannelAck_reverseLogicalChannelParameters_multiplexParameters_h2250LogicalChannelParameters
at 104 characters.
Tony Luck (see
include/linux/netfilter/nf_conntrack_h323_types.h)
With kernel 3.1, Christoph removed i_alloc_sem and replaced it with
calls (namely inode_dio_wait() and inode_dio_done()) which are
EXPORT_SYMBOL_GPL() thus they cannot be used by non-GPL file
systems and further inode_dio_wait() was pushed from
notify_change() into the file system ->setattr() method but no
non-GPL file system can make this call.
That means non-GPL file systems cannot exist any more unless they
do not use any VFS functionality related to reading/writing as far
as I can tell or at least as long as they want to implement direct
i/o.
What are commercial file systems meant to do now?
--
Anton Altaparmakov
Comments (14 posted)
Here is
a
posting on the Intel software network describing the "transactional
synchronization extensions" feature to be found in the future "Haswell"
processor.
With transactional synchronization, the hardware can
determine dynamically whether threads need to serialize through
lock-protected critical sections, and perform serialization only when
required. This lets the processor expose and exploit concurrency that would
otherwise be hidden due to dynamically unnecessary synchronization. At the
lowest level with Intel TSX, programmer-specified code regions (also
referred to as transactional regions) are executed transactionally. If the
transactional execution completes successfully, then all memory operations
performed within the transactional region will appear to have occurred
instantaneously when viewed from other logical processors. A processor
makes architectural updates performed within the region visible to other
logical processors only on a successful commit, a process referred to as an
atomic commit.
Needless to say, there should be interesting ways to use such a feature in
the kernel if it works well, but other projects (PyPy, for example) have
also expressed interest in transactional memory.
Comments (24 posted)
By Jonathan Corbet
February 8, 2012
LWN
wrote briefly about the POHMELFS
filesystem in early 2008; thereafter, POHMELFS has languished in the
staging tree without much interest or activity. The POHMELFS
developer, Evgeniy Polyakov,
expressed his
unhappiness with the development process and disappeared from the
kernel community for some time.
Now, though, Evgeniy is back with a new
POHMELFS release. He said:
It went a long way from parallel NFS design which lived in
drivers/staging/pohmelfs for years effectively without usage case -
that design was dead.
New pohmelfs uses elliptics network as its storage backend,
which was proved as effective distributed system. Elliptics is used
in production in Yandex search company for several years now and
clusters range from small (like 6 nodes in 3 datacenters to host 15
billions of small files or hundred of nodes to scale to 1 Pb used
for streaming).
This time around, he is asking that the filesystem be merged straight into
the mainline without making a stop in the staging tree. But merging a
filesystem is hard without reviews from the virtual filesystem maintainers,
and no such reviews have yet been done. So Evgeniy may have to wait a
little while longer yet.
Comments (13 posted)
Kernel development news
By Jonathan Corbet
February 7, 2012
The
announcement of the Android merging project
and the return of a number of Android-specific drivers to the kernel's
staging tree were notable events in December, 2011. The most controversial Android change - "wakelocks" or
"suspend blockers" - is not a part of this effort, though. That code is
sufficiently intrusive and sufficiently controversial that nobody seemed to
want to revisit it at this time. Except that, as it turns out, one person
is still trying to make progress on this difficult issue. Rafael Wysocki's
autosleep and wake locks patch set is yet
another attempt to support Android's opportunistic suspend mechanism in a
mainline kernel.
"Opportunistic suspend" is a heavy-handed approach to power management. In
short, whenever nothing of interest is going on, the entire system simply
suspends itself. It is certainly effective on Android devices; in
particular, it prevents poorly-written applications from keeping the system
awake and draining the battery. The hard part is the determination that
nothing interesting is happening; that is the role of the Android
wakelock/suspend blocker mechanism. With suspend blockers, both the kernel
and suitably-privileged user-space code are able to block the normal
suspension of the system, keeping it running for whatever important work is
being done.
Given that suspend blockers do not seem to be headed into the mainline
kernel anytime soon, some sort of alternative mechanism is required if the
mainline is to support opportunistic suspend. As it happens, some pieces
of that solution have been in the mainline for a while; the wakeup events infrastructure was merged for
2.6.36. Wakeup events track events (a button press, for example) that can
wake the system or keep it awake. "Wakeup sources," which track sources of
wakeup events, were merged for 2.6.37. Thus far, the wakeup events
subsystem remains lightly used in the kernel; few drivers actually signal
such events. Wakeup sources are almost entirely unused.
Rafael's patch set makes some significant changes that employ this
infrastructure to support "autosleep," which is another word for
"opportunistic suspend." (Rafael says: "This series tests the theory
that the easiest way to sell a once rejected feature is to advertise it
under a different name"). The first of those adds a new file to sysfs
called /sys/power/autosleep; writing "mem" to this file
will cause the system to suspend whenever there are no active wakeup
sources. One can also write "disk", with the result that the
system will opportunistically hibernate; this feature may see rather less
real-world use, but it was an easy addition to make.
The Android system tracks the time that suspend blockers prevent the system
from suspending; that information is then used in the "why is my battery
dead?" screen. Rafael's patch adds a similar tracking feature and exports
this time (as prevent_sleep_time) in
/sys/kernel/debug/wakeup_sources.
One little problem remains, though: wakeup sources are good for tracking
kernel-originated events, but they do not provide any way for user space to
indicate that the system should not sleep. What's needed, clearly, is a
mechanism with which user space can create its own wakeup sources. The
final patch in Rafael's series adds just such a feature. An application
can write a name (and an optional timeout) to /sys/power/wake_lock
to establish a new, active wakeup source. That source will prevent system
suspend until either its timeout expires or the same name is written to
/sys/power/wake_unlock.
It is easy to see that this mechanism can be used to implement Android's
race-free opportunistic suspend. A driver receiving a wakeup event will
mark the associated wakeup source as active, keeping the system running.
That source
will stay active until user space has consumed the event. But, before
doing so, the user-space application takes a "wake lock" of its own,
ensuring that it will be able to complete its processing before the system
goes back to sleep.
Those who have been paying attention to this controversy will have noted
that the API for this feature looks suspiciously like the native Android
API. Needless to say, that is not a coincidence; the idea is to make it as
easy as possible to switch over to the new mechanism without breaking
Android systems. If that goal can be achieved, then, even if Android
itself never moves to this implementation, it should be that much easier to
run an Android user space on a mainline kernel.
And that, of course, will be the ultimate proof of this patch set. If
somebody is able to demonstrate an Android system running with native
opportunistic suspend, with similar power consumption characteristics, then
it's a lot more likely that this patch will succeed where so many others
have failed. Arranging such a demonstration will not be entirely easy,
but, on the right hardware, it is certainly possible. Linaro's Android build
for the Pandaboard might be a good starting point. Until that happens,
getting an Android-compatible opportunistic suspend implementation into the
mainline could be challenging.
Comments (none posted)
By Jonathan Corbet
February 8, 2012
Last June, LWN
looked at a set of memory power
management patches intended to allow the system to force unused banks
of memory into partial-array self-refresh (PASR) mode.
PASR makes the memory unavailable to the CPU, but it does
preserve
those contents and reduce power consumption.
Last year's patch added a
new layer of zones to the page allocator with the idea that specific zones
- which corresponded to the regions of memory with independent PASR control
- could be vacated and powered down when memory use was light. That patch
set has not since returned, possibly because developers were worried about
the (significant) overhead of adding another layer of zones to the system.
For a little
while since, things have been quiet on the memory power management front.
Recently, though, a new and seemingly unrelated PASR
patch set was posted to linux-kernel by Maxime Coquelin. This version
adds no new zones;
instead, it works at a lower level beneath the buddy allocator.
The first step is to boot the kernel with the new ddr_die=
parameter, describing the physical layout of the system's memory. Another
parameter (interleaved) must be used if physically-interleaved
memory is present on the system. It would, of course, be nice to obtain
this information directly from the hardware, but, in the embedded world
where Maxime works, such mechanisms, if they are present at all, must be
implemented on a per-subarchitecture or per-board basis. The final patch in the series does
add built-in support for the Ux500 system in a "board support" file, but
that is the only system supported without boot-time parameters at this
early stage.
For each region defined at boot time, the PASR code sets up a
pasr_section structure:
struct pasr_section {
phys_addr_t start;
struct pasr_section *pair;
unsigned long free_size;
spinlock_t *lock;
struct pasr_die *die;
};
The key value here is free_size, which tracks how many free pages
exist within this section. When the kernel allocates a page for use, it
must tell the PASR code about it with a call to:
void pasr_kget(struct page *page, int order);
Pages that are freed should be marked with:
void pasr_kput(struct page *page, int order);
To a first approximation, these functions just increment and decrement
free_size. If free_size reaches the size of the segment,
there are no used pages within that segment and it can be powered down. As
soon as the first page is allocated from such a segment, it must be powered
back up.
Adding this accounting to the memory management code is just a matter of
adding a few pasr_kget() and pasr_kput() calls to the
buddy allocator. Most other allocations in the kernel have their ultimate
source in the buddy allocator, so this approach will catch most of the
memory allocation traffic in the system - though it could be somewhat
fooled by unused pages held by the slab allocator. There is no integration
with "carveout-style" allocators like ION or CMA, but that could certainly
be added at some point.
One piece that is missing, though, is the mechanism by which a memory
section becomes entirely free and eligible for PASR. The kernel tends to
spread pages of data throughout memory, and it does not drop them without a
specific reason to do so; a typical system shows almost no "free" pages at
all even if it is not currently doing anything. The intent is to use the feature in conjunction
with a "page cache flush governor," but that code does not exist at this
time. There was also talk of setting up a large "movable" zone and using
the compaction code to create large, free chunks within that zone.
The other thing that is missing at this point is any kind of measurement of
how much power is actually saved using PASR. That will certainly need to
be provided before this code can be considered for inclusion. Meanwhile,
it has the appearance of a less-intrusive PASR capability that might just
get past the roadblocks that stopped its predecessor.
Comments (none posted)
February 8, 2012
This article was contributed by Thomas M. Zeng
Back in December 2011, LWN reviewed the list of Android
kernel patches in the linux-next staging directory. The merging of these drivers,
one of which is a memory allocator called PMEM, holds the promise that the
mainline kernel release can one day boot an Android user space.
Since then, it has become clear
that PMEM is considered obsolete and
will be replaced by the ION memory manager.
ION is a generalized memory manager that Google introduced in the Android 4.0
ICS (Ice Cream Sandwich) release to address the issue of
fragmented memory management interfaces across different Android devices. There are at least three, probably more, PMEM-like interfaces.
On Android devices using NVIDIA Tegra, there is "NVMAP";
on Android devices using TI OMAP, there is "CMEM";
and on Android devices using Qualcomm MSM, there is "PMEM" .
All three SoC vendors are in the process of switching to ION.
This article takes a look at ION, summarizing its interfaces to user space
and to kernel-space drivers. Besides being a memory pool manager, ION also enables its clients to share buffers,
hence it treads the same ground as
the
DMA buffer sharing framework from Linaro (DMABUF). This article will end with a comparison of the two buffer sharing schemes.
ION heaps
Like its PMEM-like predecessors, ION manages one or more memory pools, some of which are set
aside at boot time to combat fragmentation or to serve special hardware needs.
GPUs, display controllers, and cameras are some of the hardware blocks that
may have special memory requirements.
ION presents its memory pools as ION heaps. Each type of Android device can be
provisioned with a different set of ION heaps according to the memory
requirements of the device.
The provider of an ION heap must implement the following set of callbacks:
struct ion_heap_ops {
int (*allocate) (struct ion_heap *heap,
struct ion_buffer *buffer, unsigned long len,
unsigned long align, unsigned long flags);
void (*free) (struct ion_buffer *buffer);
int (*phys) (struct ion_heap *heap, struct ion_buffer *buffer,
ion_phys_addr_t *addr, size_t *len);
struct scatterlist *(*map_dma) (struct ion_heap *heap,
struct ion_buffer *buffer);
void (*unmap_dma) (struct ion_heap *heap,
struct ion_buffer *buffer);
void * (*map_kernel) (struct ion_heap *heap,
struct ion_buffer *buffer);
void (*unmap_kernel) (struct ion_heap *heap,
struct ion_buffer *buffer);
int (*map_user) (struct ion_heap *heap, struct ion_buffer *buffer,
struct vm_area_struct *vma);
};
Briefly,
allocate() and
free() obtain or release an
ion_buffer object from the heap.
A call to
phys() will return the physical address and length of the buffer, but only for physically-contiguous buffers.
If the heap does not provide physically contiguous buffers, it does not have to provide this callback. Here
ion_phys_addr_t
is a typedef of
unsigned long, and will, someday, be replaced by
phys_addr_t in
include/linux/types.h.
The
map_dma() and
unmap_dma() callbacks cause the buffer
to be prepared (or unprepared) for DMA. The
map_kernel() and
unmap_kernel() callbacks map (or unmap) the physical memory into the
kernel virtual address space. A call to
map_user() will map the
memory to user space. There is no
unmap_user() because the
mapping is represented as a file descriptor in user space. The
closing of that file descriptor will cause the memory to be unmapped from
the calling process.
The default ION driver (which can be cloned from here) offers three heaps as listed below:
ION_HEAP_TYPE_SYSTEM: memory allocated via vmalloc_user().
ION_HEAP_TYPE_SYSTEM_CONTIG: memory allocated via kzalloc.
ION_HEAP_TYPE_CARVEOUT: carveout memory is physically contiguous and set aside at boot.
Developers may choose to add more ION heaps. For example,
this NVIDIA patch was
submitted to add ION_HEAP_TYPE_IOMMU for hardware blocks equipped with an IOMMU.
Using ION from user space
Typically, user space device access libraries will use ION to allocate large contiguous media buffers.
For example, the still camera library
may allocate a capture buffer to be used by the camera device. Once the
buffer is fully populated with video data,
the library can pass the buffer to the kernel
to be processed by a JPEG encoder hardware block.
A user space C/C++ program must have been granted access to the /dev/ion device before
it can allocate memory from ION.
A call to
open("/dev/ion", O_RDONLY) returns a file descriptor as a handle representing
an ION client. Yes, one can allocate writable memory with an O_RDONLY open.
There can be no more than one client per user process. To allocate a buffer,
the client needs to fill in all the fields except the handle field in
this data structure:
struct ion_allocation_data {
size_t len;
size_t align;
unsigned int flags;
struct ion_handle *handle;
}
The
handle field is the output parameter, while the first three fields specify
the alignment, length and flags as input parameters. The
flags field is a bit mask
indicating one or more ION heaps to allocate from, with the fallback ordered according to which ION heap was
first added via calls to
ion_device_add_heap() during boot.
In the default implementation,
ION_HEAP_TYPE_CARVEOUT is added
before
ION_HEAP_TYPE_CONTIG.
The flags of
ION_HEAP_TYPE_CONTIG | ION_HEAP_TYPE_CARVEOUT
indicate the intention to allocate from
ION_HEAP_TYPE_CARVEOUT
with fallback to
ION_HEAP_TYPE_CONTIG.
User-space clients interact with ION using the ioctl() system call interface.
To allocate a buffer, the client makes this call:
int ioctl(int client_fd, ION_IOC_ALLOC, struct ion_allocation_data *allocation_data)
This call returns a buffer represented by
ion_handle which is not a CPU-accessible
buffer pointer. The handle can only be used to obtain a file descriptor for buffer
sharing as follows:
int ioctl(int client_fd, ION_IOC_SHARE, struct ion_fd_data *fd_data);
Here
client_fd is the file descriptor corresponding to
/dev/ion, and
fd_data is a data structure with an input
handle field and an output
fd field, as defined below:
struct ion_fd_data {
struct ion_handle *handle;
int fd;
}
The
fd field is the file descriptor that can be
passed around for sharing. On Android devices the
BINDER IPC mechanism
may be used to send
fd to another process for sharing.
To obtain the shared buffer, the second user process must obtain
a client handle first via the
open("/dev/ion", O_RDONLY) system call.
ION tracks its user space clients by the PID of the process (specifically, the PID
of the thread that is the "group leader" in the process). Repeating the
open("/dev/ion", O_RDONLY) call in the same process will get back
another file descriptor corresponding to the same client structure in
the kernel.
To free the buffer, the second client needs to undo the effect of mmap() with a
call to munmap(), and the first client needs to close the file descriptor it obtained
via ION_IOC_SHARE, and call ION_IOC_FREE as follows:
int ioctl(int client_fd, ION_IOC_FREE, struct ion_handle_data *handle_data);
Here
ion_handle_data holds the handle as shown below:
struct ion_handle_data {
struct ion_handle *handle;
}
The
ION_IOC_FREE command causes the handle's reference counter
to be decremented by one. When this reference counter reaches zero, the
ion_handle object
gets destroyed and the affected ION bookkeeping data structure is updated.
User processes can also share ION buffers with a kernel driver, as explained
in the next section.
Sharing ION buffers in the kernel
In the kernel, ION supports multiple clients, one for each driver that uses the ION functionality.
A kernel driver calls the following function to obtain an ION client handle:
struct ion_client *ion_client_create(struct ion_device *dev,
unsigned int heap_mask, const char *debug_name)
The first argument, dev, is the global ION device associated with
/dev/ion; why a global device is needed, and why it must be passed
as a parameter, is not entirely clear. The second argument,
heap_mask, selects one or more ION heaps
in the same way as the ion_allocation_data.
The flags field was covered in the previous section.
For smart phone use cases involving multimedia middleware,
the user process typically allocates the buffer from ION, obtains a file descriptor using
the ION_IOC_SHARE command, then passes the file desciptor to a
kernel driver.
The kernel driver calls
ion_import_fd() which converts the file descriptor to an ion_handle object,
as shown below:
struct ion_handle *ion_import_fd(struct ion_client *client, int fd_from_user);
The
ion_handle object is the driver's client-local reference to
the shared buffer. The
ion_import_fd() call looks up the physical address of the buffer to see whether the client
has obtained a handle to the same buffer before, and if it has, this call simply increments
the reference counter of the existing handle.
Some hardware blocks can only operate on physically-contiguous buffers with
physical addresses, so
affected drivers need to convert ion_handle to a physical buffer via this call:
int ion_phys(struct ion_client *client, struct ion_handle *handle,
ion_phys_addr_t *addr, size_t *len)
Needless to say, if the buffer is not physically contiguous, this call will
fail.
When handling calls from a client, ION always validates
the input file descriptor, client and handle arguments. For example, when importing a file descriptor, ION
ensures the file descriptor was indeed created by an ION_IOC_SHARE
command.
When ion_phys() is called, ION validates whether the buffer handle belongs to the list of handles
the client is allowed to access, and returns error if the handle is not on the list.
This validation mechanism reduces the likelihood of unwanted accesses and inadvertent resource
leaks.
ION provides debug visibility through debugfs. It organizes debug information under /sys/kernel/debug/ion,
with bookkeeping information in stored files associated with heaps and
clients identified by symbolic names or PIDs.
Comparing ION and DMABUF
ION and DMABUF share some common concepts. The dma_buf concept
is similar to ion_buffer, while dma_buf_attachment
serves a similar
purpose as ion_handle. Both ION and DMABUF use anonymous file descriptors
as the objects that can be passed around to provide reference-counted access to shared buffers.
On the other hand, ION focuses on allocating and freeing memory from provisioned
memory pools in a manner that can be shared and tracked, while DMABUF focuses
more on buffer importing, exporting and synchronization in a manner that is consistent with buffer sharing
solutions on non-ARM architectures.
The following table presents a feature comparison between ION and DMABUF:
| Feature |
ION |
DMABUF |
| Memory Manager Role |
ION replaces PMEM as the manager of provisioned memory pools.
The list of ION heaps can be extended per device. |
DMABUF is a buffer sharing framework,
designed to integrate with the memory allocators in DMA mapping frameworks, like the work-in-progress DMA-contiguous allocator, also known as the
Contiguous Memory
Allocator (CMA). DMABUF exporters have the option to implement custom allocators. |
| User Space Access Control |
ION offers the /dev/ion interface for user-space programs to allocate and share buffers.
Any user program with ION access can cripple the system by depleting the ION heaps. Android
checks user and group IDs to block unauthorized access to ION heaps. |
DMABUF offers only kernel APIs. Access control is a function of the permissions on the devices using the DMABUF feature. |
| Global Client and Buffer Database |
ION contains a device driver associated with /dev/ion. The device structure contains a database that tracks the allocated ION buffers,
handles and file descriptors, all grouped by user clients and kernel clients. ION validates all client calls according to the rules of the database.
For example, there is a rule that a client cannot have two handles to the same buffer. |
The DMA debug facility
implements a global hashtable, dma_entry_hash, to track DMA buffers, but only when the kernel was built with the CONFIG_DMA_API_DEBUG option. |
| Cross-architecture Usage |
ION usage today is limited to architectures that run the Android kernel. |
DMABUF usage is cross-architecture.
The DMA mapping redesign preparation patchset modified the DMA mapping code in 9 architectures besides the ARM architecture. |
| Buffer Synchronization |
ION considers buffer synchronization to be an orthogonal problem. |
DMABUF provides a pair of APIs for synchronization. The buffer-user calls
dma_buf_map_attachment() whenever it wants to use the buffer for DMA . Once the DMA for the current buffer-user is over,
it signals 'end-of-DMA' to the exporter via a call to dma_buf_unmap_attachment() . |
| Delayed Buffer Allocation |
ION allocates the physical memory before the buffer is shared. |
DMABUF can defer the allocation
until the first call to dma_buf_map_attachment(). The exporter of DMA buffer has
the opportunity to scan all client attachments, collate their buffer constraints, then choose
the appropriate backing storage.
|
ION and DMABUF can be separately integrated into multimedia applications written using
the Video4Linux2 API.
In the case of ION, these multimedia
programs tend to use PMEM now on Android devices, so switching to ION from PMEM should have a
relatively small impact.
Integrating DMABUF into Video4Linux2 is another story.
It has taken
ten
patches to integrate the videobuf2 mechanism with DMABUF; in
fairness, many of these revisions were the result of changes to DMABUF as
that interface stabilized.
The effort should pay dividends in the long run because the DMABUF-based
sharing mechanism is designed with DMA mapping hooks for CMA and IOMMU.
CMA and IOMMU hold the promise to reduce the amount of carveout memory that it takes to build an Android smart phone.
In this email,
Andrew Morton was urging the completion of the patch review process so that CMA can get through the 3.4 merge window.
Even though ION and DMABUF serve similar purposes, the two are not mutually exclusive.
The Linaro Unified Memory Management team has started to integrate CMA into ION.
To reach the state where a release of the mainline kernel can boot the Android user space, the /dev/ion interface to user space must obviously be preserved.
In the kernel though, ION drivers may be able to use
some of the DMABUF APIs to hook into CMA and IOMMU to take advantage
of the capabilities offered by those subsystems. Conversely, DMABUF might be able to leverage ION to present a unified interface to user space,
especially to the Android user space.
DMABUF may also benefit from adopting some of the ION heap debugging features in order to become more developer friendly.
Thus far, many signs indicate that Linaro, Google, and the kernel community are working together to bring the combined strength of ION and DMABUF to the mainline kernel.
Comments (7 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
- Lucas De Marchi: kmod 5 .
(February 6, 2012)
Page editor: Jonathan Corbet
Next page: Distributions>>