Kernel development
Brief items
Kernel release status
The current development kernel remains 3.3-rc2; there have been no 3.3 prepatches released in the last week.The Stable updates picture is somewhat more complicated. 2.6.32.56, 3.0.19, and 3.2.3 were released on February 3 with a long list of patches. 3.2.4 followed shortly thereafter to fix a build failure introduced in 3.2.3.
On February 6, 3.0.20 and 3.2.5 were released. These were single-patch updates containing the fix to the ASPM-related problem that would significantly increase power consumption on some systems. This patch has been treated with some care: it seems to work, but nobody really knows if it might cause behavioral problems on some obscure hardware. That said, at this point, it seems safe enough to have found its way into a stable update.
Quotes of the week
If I messed anything up, or the patches need more information within the body of the changelog, please let me know, and I'll be glad to respin them.
eOpenLogicalChannelAck_reverseLogicalChannelParameters_multiplexParameters_h2250LogicalChannelParameters
at 104 characters.
That means non-GPL file systems cannot exist any more unless they do not use any VFS functionality related to reading/writing as far as I can tell or at least as long as they want to implement direct i/o.
What are commercial file systems meant to do now?
Intel's upcoming transactional memory feature
Here is a posting on the Intel software network describing the "transactional synchronization extensions" feature to be found in the future "Haswell" processor.
Needless to say, there should be interesting ways to use such a feature in the kernel if it works well, but other projects (PyPy, for example) have also expressed interest in transactional memory.
POHMELFS returns
LWN wrote briefly about the POHMELFS filesystem in early 2008; thereafter, POHMELFS has languished in the staging tree without much interest or activity. The POHMELFS developer, Evgeniy Polyakov, expressed his unhappiness with the development process and disappeared from the kernel community for some time.Now, though, Evgeniy is back with a new POHMELFS release. He said:
New pohmelfs uses elliptics network as its storage backend, which was proved as effective distributed system. Elliptics is used in production in Yandex search company for several years now and clusters range from small (like 6 nodes in 3 datacenters to host 15 billions of small files or hundred of nodes to scale to 1 Pb used for streaming).
This time around, he is asking that the filesystem be merged straight into the mainline without making a stop in the staging tree. But merging a filesystem is hard without reviews from the virtual filesystem maintainers, and no such reviews have yet been done. So Evgeniy may have to wait a little while longer yet.
Kernel development news
Autosleep and wake locks
The announcement of the Android merging project and the return of a number of Android-specific drivers to the kernel's staging tree were notable events in December, 2011. The most controversial Android change - "wakelocks" or "suspend blockers" - is not a part of this effort, though. That code is sufficiently intrusive and sufficiently controversial that nobody seemed to want to revisit it at this time. Except that, as it turns out, one person is still trying to make progress on this difficult issue. Rafael Wysocki's autosleep and wake locks patch set is yet another attempt to support Android's opportunistic suspend mechanism in a mainline kernel."Opportunistic suspend" is a heavy-handed approach to power management. In short, whenever nothing of interest is going on, the entire system simply suspends itself. It is certainly effective on Android devices; in particular, it prevents poorly-written applications from keeping the system awake and draining the battery. The hard part is the determination that nothing interesting is happening; that is the role of the Android wakelock/suspend blocker mechanism. With suspend blockers, both the kernel and suitably-privileged user-space code are able to block the normal suspension of the system, keeping it running for whatever important work is being done.
Given that suspend blockers do not seem to be headed into the mainline kernel anytime soon, some sort of alternative mechanism is required if the mainline is to support opportunistic suspend. As it happens, some pieces of that solution have been in the mainline for a while; the wakeup events infrastructure was merged for 2.6.36. Wakeup events track events (a button press, for example) that can wake the system or keep it awake. "Wakeup sources," which track sources of wakeup events, were merged for 2.6.37. Thus far, the wakeup events subsystem remains lightly used in the kernel; few drivers actually signal such events. Wakeup sources are almost entirely unused.
Rafael's patch set makes some significant changes that employ this
infrastructure to support "autosleep," which is another word for
"opportunistic suspend." (Rafael says: "This series tests the theory
that the easiest way to sell a once rejected feature is to advertise it
under a different name
"). The first of those adds a new file to sysfs
called /sys/power/autosleep; writing "mem" to this file
will cause the system to suspend whenever there are no active wakeup
sources. One can also write "disk", with the result that the
system will opportunistically hibernate; this feature may see rather less
real-world use, but it was an easy addition to make.
The Android system tracks the time that suspend blockers prevent the system from suspending; that information is then used in the "why is my battery dead?" screen. Rafael's patch adds a similar tracking feature and exports this time (as prevent_sleep_time) in /sys/kernel/debug/wakeup_sources.
One little problem remains, though: wakeup sources are good for tracking kernel-originated events, but they do not provide any way for user space to indicate that the system should not sleep. What's needed, clearly, is a mechanism with which user space can create its own wakeup sources. The final patch in Rafael's series adds just such a feature. An application can write a name (and an optional timeout) to /sys/power/wake_lock to establish a new, active wakeup source. That source will prevent system suspend until either its timeout expires or the same name is written to /sys/power/wake_unlock.
It is easy to see that this mechanism can be used to implement Android's race-free opportunistic suspend. A driver receiving a wakeup event will mark the associated wakeup source as active, keeping the system running. That source will stay active until user space has consumed the event. But, before doing so, the user-space application takes a "wake lock" of its own, ensuring that it will be able to complete its processing before the system goes back to sleep.
Those who have been paying attention to this controversy will have noted that the API for this feature looks suspiciously like the native Android API. Needless to say, that is not a coincidence; the idea is to make it as easy as possible to switch over to the new mechanism without breaking Android systems. If that goal can be achieved, then, even if Android itself never moves to this implementation, it should be that much easier to run an Android user space on a mainline kernel.
And that, of course, will be the ultimate proof of this patch set. If somebody is able to demonstrate an Android system running with native opportunistic suspend, with similar power consumption characteristics, then it's a lot more likely that this patch will succeed where so many others have failed. Arranging such a demonstration will not be entirely easy, but, on the right hardware, it is certainly possible. Linaro's Android build for the Pandaboard might be a good starting point. Until that happens, getting an Android-compatible opportunistic suspend implementation into the mainline could be challenging.
Memory power management, take 2
Last June, LWN looked at a set of memory power management patches intended to allow the system to force unused banks of memory into partial-array self-refresh (PASR) mode. PASR makes the memory unavailable to the CPU, but it doesFor a little while since, things have been quiet on the memory power management front. Recently, though, a new and seemingly unrelated PASR patch set was posted to linux-kernel by Maxime Coquelin. This version adds no new zones; instead, it works at a lower level beneath the buddy allocator.
The first step is to boot the kernel with the new ddr_die= parameter, describing the physical layout of the system's memory. Another parameter (interleaved) must be used if physically-interleaved memory is present on the system. It would, of course, be nice to obtain this information directly from the hardware, but, in the embedded world where Maxime works, such mechanisms, if they are present at all, must be implemented on a per-subarchitecture or per-board basis. The final patch in the series does add built-in support for the Ux500 system in a "board support" file, but that is the only system supported without boot-time parameters at this early stage.
For each region defined at boot time, the PASR code sets up a pasr_section structure:
struct pasr_section {
phys_addr_t start;
struct pasr_section *pair;
unsigned long free_size;
spinlock_t *lock;
struct pasr_die *die;
};
The key value here is free_size, which tracks how many free pages exist within this section. When the kernel allocates a page for use, it must tell the PASR code about it with a call to:
void pasr_kget(struct page *page, int order);
Pages that are freed should be marked with:
void pasr_kput(struct page *page, int order);
To a first approximation, these functions just increment and decrement free_size. If free_size reaches the size of the segment, there are no used pages within that segment and it can be powered down. As soon as the first page is allocated from such a segment, it must be powered back up.
Adding this accounting to the memory management code is just a matter of adding a few pasr_kget() and pasr_kput() calls to the buddy allocator. Most other allocations in the kernel have their ultimate source in the buddy allocator, so this approach will catch most of the memory allocation traffic in the system - though it could be somewhat fooled by unused pages held by the slab allocator. There is no integration with "carveout-style" allocators like ION or CMA, but that could certainly be added at some point.
One piece that is missing, though, is the mechanism by which a memory section becomes entirely free and eligible for PASR. The kernel tends to spread pages of data throughout memory, and it does not drop them without a specific reason to do so; a typical system shows almost no "free" pages at all even if it is not currently doing anything. The intent is to use the feature in conjunction with a "page cache flush governor," but that code does not exist at this time. There was also talk of setting up a large "movable" zone and using the compaction code to create large, free chunks within that zone.
The other thing that is missing at this point is any kind of measurement of how much power is actually saved using PASR. That will certainly need to be provided before this code can be considered for inclusion. Meanwhile, it has the appearance of a less-intrusive PASR capability that might just get past the roadblocks that stopped its predecessor.
The Android ION memory allocator
Back in December 2011, LWN reviewed the list of Android kernel patches in the linux-next staging directory. The merging of these drivers, one of which is a memory allocator called PMEM, holds the promise that the mainline kernel release can one day boot an Android user space. Since then, it has become clear that PMEM is considered obsolete and will be replaced by the ION memory manager. ION is a generalized memory manager that Google introduced in the Android 4.0 ICS (Ice Cream Sandwich) release to address the issue of fragmented memory management interfaces across different Android devices. There are at least three, probably more, PMEM-like interfaces. On Android devices using NVIDIA Tegra, there is "NVMAP"; on Android devices using TI OMAP, there is "CMEM"; and on Android devices using Qualcomm MSM, there is "PMEM" . All three SoC vendors are in the process of switching to ION.
This article takes a look at ION, summarizing its interfaces to user space and to kernel-space drivers. Besides being a memory pool manager, ION also enables its clients to share buffers, hence it treads the same ground as the DMA buffer sharing framework from Linaro (DMABUF). This article will end with a comparison of the two buffer sharing schemes.ION heaps
Like its PMEM-like predecessors, ION manages one or more memory pools, some of which are set aside at boot time to combat fragmentation or to serve special hardware needs. GPUs, display controllers, and cameras are some of the hardware blocks that may have special memory requirements. ION presents its memory pools as ION heaps. Each type of Android device can be provisioned with a different set of ION heaps according to the memory requirements of the device. The provider of an ION heap must implement the following set of callbacks:
struct ion_heap_ops {
int (*allocate) (struct ion_heap *heap,
struct ion_buffer *buffer, unsigned long len,
unsigned long align, unsigned long flags);
void (*free) (struct ion_buffer *buffer);
int (*phys) (struct ion_heap *heap, struct ion_buffer *buffer,
ion_phys_addr_t *addr, size_t *len);
struct scatterlist *(*map_dma) (struct ion_heap *heap,
struct ion_buffer *buffer);
void (*unmap_dma) (struct ion_heap *heap,
struct ion_buffer *buffer);
void * (*map_kernel) (struct ion_heap *heap,
struct ion_buffer *buffer);
void (*unmap_kernel) (struct ion_heap *heap,
struct ion_buffer *buffer);
int (*map_user) (struct ion_heap *heap, struct ion_buffer *buffer,
struct vm_area_struct *vma);
};
Briefly, allocate() and free() obtain or release an ion_buffer object from the heap.
A call to phys() will return the physical address and length of the buffer, but only for physically-contiguous buffers.
If the heap does not provide physically contiguous buffers, it does not have to provide this callback. Here ion_phys_addr_t
is a typedef of unsigned long, and will, someday, be replaced by
phys_addr_t in include/linux/types.h.
The map_dma() and unmap_dma() callbacks cause the buffer
to be prepared (or unprepared) for DMA. The map_kernel() and
unmap_kernel() callbacks map (or unmap) the physical memory into the
kernel virtual address space. A call to map_user() will map the
memory to user space. There is no unmap_user() because the
mapping is represented as a file descriptor in user space. The
closing of that file descriptor will cause the memory to be unmapped from
the calling process.
The default ION driver (which can be cloned from here) offers three heaps as listed below:
ION_HEAP_TYPE_SYSTEM: memory allocated via vmalloc_user(). ION_HEAP_TYPE_SYSTEM_CONTIG: memory allocated via kzalloc. ION_HEAP_TYPE_CARVEOUT: carveout memory is physically contiguous and set aside at boot.Developers may choose to add more ION heaps. For example, this NVIDIA patch was submitted to add ION_HEAP_TYPE_IOMMU for hardware blocks equipped with an IOMMU.
Using ION from user space
Typically, user space device access libraries will use ION to allocate large contiguous media buffers. For example, the still camera library may allocate a capture buffer to be used by the camera device. Once the buffer is fully populated with video data, the library can pass the buffer to the kernel to be processed by a JPEG encoder hardware block.A user space C/C++ program must have been granted access to the /dev/ion device before it can allocate memory from ION. A call to open("/dev/ion", O_RDONLY) returns a file descriptor as a handle representing an ION client. Yes, one can allocate writable memory with an O_RDONLY open. There can be no more than one client per user process. To allocate a buffer, the client needs to fill in all the fields except the handle field in this data structure:
struct ion_allocation_data {
size_t len;
size_t align;
unsigned int flags;
struct ion_handle *handle;
}
The handle field is the output parameter, while the first three fields specify
the alignment, length and flags as input parameters. The flags field is a bit mask
indicating one or more ION heaps to allocate from, with the fallback ordered according to which ION heap was
first added via calls to
ion_device_add_heap() during boot.
In the default implementation, ION_HEAP_TYPE_CARVEOUT is added
before ION_HEAP_TYPE_CONTIG.
The flags of ION_HEAP_TYPE_CONTIG | ION_HEAP_TYPE_CARVEOUT
indicate the intention to allocate from ION_HEAP_TYPE_CARVEOUT
with fallback to ION_HEAP_TYPE_CONTIG.
User-space clients interact with ION using the ioctl() system call interface. To allocate a buffer, the client makes this call:
int ioctl(int client_fd, ION_IOC_ALLOC, struct ion_allocation_data *allocation_data)This call returns a buffer represented by ion_handle which is not a CPU-accessible buffer pointer. The handle can only be used to obtain a file descriptor for buffer sharing as follows:
int ioctl(int client_fd, ION_IOC_SHARE, struct ion_fd_data *fd_data);Here client_fd is the file descriptor corresponding to /dev/ion, and fd_data is a data structure with an input handle field and an output fd field, as defined below:
struct ion_fd_data {
struct ion_handle *handle;
int fd;
}
The fd field is the file descriptor that can be
passed around for sharing. On Android devices the BINDER IPC mechanism
may be used to send fd to another process for sharing.
To obtain the shared buffer, the second user process must obtain
a client handle first via the open("/dev/ion", O_RDONLY) system call.
ION tracks its user space clients by the PID of the process (specifically, the PID
of the thread that is the "group leader" in the process). Repeating the
open("/dev/ion", O_RDONLY) call in the same process will get back
another file descriptor corresponding to the same client structure in
the kernel.
To free the buffer, the second client needs to undo the effect of mmap() with a call to munmap(), and the first client needs to close the file descriptor it obtained via ION_IOC_SHARE, and call ION_IOC_FREE as follows:
int ioctl(int client_fd, ION_IOC_FREE, struct ion_handle_data *handle_data);
Here ion_handle_data holds the handle as shown below:
struct ion_handle_data {
struct ion_handle *handle;
}
The ION_IOC_FREE command causes the handle's reference counter
to be decremented by one. When this reference counter reaches zero, the ion_handle object
gets destroyed and the affected ION bookkeeping data structure is updated.
User processes can also share ION buffers with a kernel driver, as explained in the next section.
Sharing ION buffers in the kernel
In the kernel, ION supports multiple clients, one for each driver that uses the ION functionality. A kernel driver calls the following function to obtain an ION client handle:
struct ion_client *ion_client_create(struct ion_device *dev,
unsigned int heap_mask, const char *debug_name)
The first argument, dev, is the global ION device associated with /dev/ion; why a global device is needed, and why it must be passed as a parameter, is not entirely clear. The second argument, heap_mask, selects one or more ION heaps in the same way as the ion_allocation_data. The flags field was covered in the previous section. For smart phone use cases involving multimedia middleware, the user process typically allocates the buffer from ION, obtains a file descriptor using the ION_IOC_SHARE command, then passes the file desciptor to a kernel driver. The kernel driver calls ion_import_fd() which converts the file descriptor to an ion_handle object, as shown below:
struct ion_handle *ion_import_fd(struct ion_client *client, int fd_from_user);
The ion_handle object is the driver's client-local reference to
the shared buffer. The
ion_import_fd() call looks up the physical address of the buffer to see whether the client
has obtained a handle to the same buffer before, and if it has, this call simply increments
the reference counter of the existing handle.
Some hardware blocks can only operate on physically-contiguous buffers with physical addresses, so affected drivers need to convert ion_handle to a physical buffer via this call:
int ion_phys(struct ion_client *client, struct ion_handle *handle, ion_phys_addr_t *addr, size_t *len)
Needless to say, if the buffer is not physically contiguous, this call will fail.
When handling calls from a client, ION always validates the input file descriptor, client and handle arguments. For example, when importing a file descriptor, ION ensures the file descriptor was indeed created by an ION_IOC_SHARE command. When ion_phys() is called, ION validates whether the buffer handle belongs to the list of handles the client is allowed to access, and returns error if the handle is not on the list. This validation mechanism reduces the likelihood of unwanted accesses and inadvertent resource leaks.
ION provides debug visibility through debugfs. It organizes debug information under /sys/kernel/debug/ion, with bookkeeping information in stored files associated with heaps and clients identified by symbolic names or PIDs.
Comparing ION and DMABUF
ION and DMABUF share some common concepts. The dma_buf concept is similar to ion_buffer, while dma_buf_attachment serves a similar purpose as ion_handle. Both ION and DMABUF use anonymous file descriptors as the objects that can be passed around to provide reference-counted access to shared buffers. On the other hand, ION focuses on allocating and freeing memory from provisioned memory pools in a manner that can be shared and tracked, while DMABUF focuses more on buffer importing, exporting and synchronization in a manner that is consistent with buffer sharing solutions on non-ARM architectures.
The following table presents a feature comparison between ION and DMABUF:
| Feature | ION | DMABUF |
|---|---|---|
| Memory Manager Role | ION replaces PMEM as the manager of provisioned memory pools. The list of ION heaps can be extended per device. | DMABUF is a buffer sharing framework, designed to integrate with the memory allocators in DMA mapping frameworks, like the work-in-progress DMA-contiguous allocator, also known as the Contiguous Memory Allocator (CMA). DMABUF exporters have the option to implement custom allocators. |
| User Space Access Control | ION offers the /dev/ion interface for user-space programs to allocate and share buffers. Any user program with ION access can cripple the system by depleting the ION heaps. Android checks user and group IDs to block unauthorized access to ION heaps. | DMABUF offers only kernel APIs. Access control is a function of the permissions on the devices using the DMABUF feature. |
| Global Client and Buffer Database | ION contains a device driver associated with /dev/ion. The device structure contains a database that tracks the allocated ION buffers, handles and file descriptors, all grouped by user clients and kernel clients. ION validates all client calls according to the rules of the database. For example, there is a rule that a client cannot have two handles to the same buffer. | The DMA debug facility implements a global hashtable, dma_entry_hash, to track DMA buffers, but only when the kernel was built with the CONFIG_DMA_API_DEBUG option. |
| Cross-architecture Usage | ION usage today is limited to architectures that run the Android kernel. | DMABUF usage is cross-architecture. The DMA mapping redesign preparation patchset modified the DMA mapping code in 9 architectures besides the ARM architecture. |
| Buffer Synchronization | ION considers buffer synchronization to be an orthogonal problem. | DMABUF provides a pair of APIs for synchronization. The buffer-user calls dma_buf_map_attachment() whenever it wants to use the buffer for DMA . Once the DMA for the current buffer-user is over, it signals 'end-of-DMA' to the exporter via a call to dma_buf_unmap_attachment() . |
| Delayed Buffer Allocation | ION allocates the physical memory before the buffer is shared. | DMABUF can defer the allocation until the first call to dma_buf_map_attachment(). The exporter of DMA buffer has the opportunity to scan all client attachments, collate their buffer constraints, then choose the appropriate backing storage. |
ION and DMABUF can be separately integrated into multimedia applications written using the Video4Linux2 API. In the case of ION, these multimedia programs tend to use PMEM now on Android devices, so switching to ION from PMEM should have a relatively small impact.
Integrating DMABUF into Video4Linux2 is another story. It has taken ten patches to integrate the videobuf2 mechanism with DMABUF; in fairness, many of these revisions were the result of changes to DMABUF as that interface stabilized. The effort should pay dividends in the long run because the DMABUF-based sharing mechanism is designed with DMA mapping hooks for CMA and IOMMU. CMA and IOMMU hold the promise to reduce the amount of carveout memory that it takes to build an Android smart phone. In this email, Andrew Morton was urging the completion of the patch review process so that CMA can get through the 3.4 merge window.
Even though ION and DMABUF serve similar purposes, the two are not mutually exclusive. The Linaro Unified Memory Management team has started to integrate CMA into ION. To reach the state where a release of the mainline kernel can boot the Android user space, the /dev/ion interface to user space must obviously be preserved. In the kernel though, ION drivers may be able to use some of the DMABUF APIs to hook into CMA and IOMMU to take advantage of the capabilities offered by those subsystems. Conversely, DMABUF might be able to leverage ION to present a unified interface to user space, especially to the Android user space. DMABUF may also benefit from adopting some of the ION heap debugging features in order to become more developer friendly. Thus far, many signs indicate that Linaro, Google, and the kernel community are working together to bring the combined strength of ION and DMABUF to the mainline kernel.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
