Kernel development

Brief items

Kernel release status

The 4.3 merge window is still open; see the separate article below for a summary of what has been merged in the last week.

Stable updates: none have been released in the last week.

Comments (none posted)

Quotes of the week

Seriously. When was the last time you heard somebody complain about "I wrote the number 9223372036854775809, and the computer thought I meant '1' - I'm really really upset"

— Linus Torvalds

Documentation can [be], and often is, wrong. The code that has been there for more than two decades determines what the interface and semantics actually are.

— David Miller

Comments (9 posted)

The Linux Test Project has been released for September 2015

The Linux Test Project (LTP) has made a stable release for September 2015. The previous release was in April. This release has a number of new test cases including ones for user namespaces, virtual network interfaces, umount2(), getrandom(), and more. In addition, the network namespace test cases were rewritten and regression tests have been added for inotify, cpuset, futex_wake(), and recvmsg(). We looked at writing LTP test cases back in January.

Full Story (comments: none)

Kernel development news

4.3 Merge window, part 2

By Jonathan Corbet
September 10, 2015

As of this writing, some 10,200 non-merge changesets have been pulled into the mainline repository — 6,200 since last week's summary. The 4.3 development cycle thus looks to be a busy one, even if it doesn't quite match the volume seen in 4.2. Quite a few interesting features have been pulled into the mainline over the last week.

First, though, a couple of items from last week deserve a followup mention:

As predicted, the removal of the ext3 filesystem eventually went through. Linus was worried about the effect of the removal on ext3 users, but was eventually convinced that the ext4 maintainers will continue to support those users without forcing their filesystems forward to the ext4 format.
The disabling of the VM86 feature described last week appears to have been a bit premature; some complaints have made it clear that it's a feature that would be missed. So VM86 will likely come back before the 4.3 kernel is released. Linus had an interesting idea, though: setting the mmap_min_addr parameter to a non-zero value effectively makes VM86 unusable for DOS emulation, so it would be reasonable to disable VM86 in that case. The kernel's default setting is 4,096, and most distributions use a value at least that high, so the end result would be to disable VM86 on the vast majority of systems where it cannot be used anyway.

Other interesting, user-visible activity in the last week includes:

The user-space page-fault handling patch set has been merged at last. The main use case for this feature is live migration of virtualized guests, but others probably exist as well. See Documentation/vm/userfaultfd.txt for more information.
The ambient capabilities work has been merged, changing the way capability inheritance is managed. See this commit message for lots of details.
Support for IPv6 is now built into the kernel by default. Tom Herbert justified this change in the changelog by saying: "IPv6 now has significant traction and any remaining vestiges of IPv6 not being provided parity with IPv4 should be swept away. IPv6 is now core to the Internet and kernel."
The networking layer now has "lightweight tunnel" support. In the networking pull request, Dave Miller said: "I've heard rumblings that the lightweight tunnels infrastructure has been voted networking change of the year. But what do I know?" Indeed it may be a while before any of us know, since this feature appears to be quite thoroughly undocumented. A bit of information does appear in this merge commit, though.
Equally undocumented is the virtual routing domains feature, which allows the splitting of the kernel's routing tables into disjoint planes. It appears to be a virtualization feature. See the merge commit for some information.
The identifier locator addressing feature is aimed at communication within data centers where tasks can migrate from one machine to another.
The discard_max_bytes parameter associated with block devices is now writable. Administrators who are concerned about massive latencies caused by large discard operations can tweak this parameter downward, causing those operations to be split into smaller operations.
The Open vSwitch subsystem has gained a new module providing access to the kernel's network connection-tracking mechanism.
The new "overflow scheduler" in the IP virtual server subsystem "directs network connections to the server with the highest weight that is currently available and overflows to the next when active connections exceed the node's weight"
The MIPS architecture has gained support for the user-space probes (uprobes) mechanism.
There is a new ptrace() operation (PTRACE_O_SUSPEND_SECCOMP) that can be used to suspend secure computing (seccomp) filtering. This operation can only be invoked by a process with CAP_SYS_ADMIN in the initial namespace; it is intended to make it possible to checkpoint processes running in the seccomp mode.
The Smack security module has gained the ability to associate labels with IPv6 addresses.
The SELinux security module has a new ability to check ioctl() calls on a per-command basis.
Audit rules can now target the actions of a process based on which executable it is running.
New hardware support includes:
- Audio: Cirrus Logic CS4349 codecs, Option GTM601 UMTS modem audio codecs, InvenSense ICS-43432 I2S MEMS microphones, Realtek ALC298 codecs, and STI SAS codecs.
- DMA: NXP LPC18xx/43xx DMA engines, Allwinner A10 DMA controllers, ZTE ZX296702 DMA engines, and Analog Devices AXI-DMAC DMA controllers.
- Media: Toshiba TC358743 HDMI to MIPI CSI-2 bridges, Renesas JPEG processing units, Sony Horus3A and Ascot2E tuners, Sony CXD2841ER DVB-S/S2/T/T2/C demodulators, STM LNBH25 SEC controllers, NetUP Universal DVB cards, and STMicroelectronics C8SECTPFE DVB cards.
- Miscellaneous: NXP LPC SPI flash interfaces, IBM CXL-attached flash accelerator SCSI controllers, ZTE ZX GPIO controllers, LG LG4573 TFT liquid crystal displays, Freescale DCU graphics adapters, NXP LPC178x/18xx/408x/43xx realtime clocks, NXP LPC178x/18xx/408x/43xx I2C controllers, Zynq Ultrascale+ MPSoC realtime clocks, Renesas EMEV2 IIC controllers, Atmel SDMMC controllers, and Intel OPA Gen1 InfiniBand adapters.
- Multi-function devices: Wolfson Microelectronics WM8998 controllers and Dialog Semiconductor DA9062 power-management ICs.
- Networking: Teranetics TN2020 PHYs, Sypnopsys DWC Ethernet QOS v4.10a controllers, Mellanox Technologies switches, Microchip LAN78XX-based USB Ethernet adapters, Samsung S3FWRN5 NCI NFC controllers, and Fujitsu Extended Socket network devices.
- Pin control: Freescale i.MX6UL pin controllers, UniPhier PH1-LD4, PH1-Pro4, PH1-sLD8, PH1-Pro5, ProXstream2, and PH1-LD6b SoC pin controllers, Qualcomm SSBI PMIC pin controllers, and Qualcomm QDF2xxx pin controllers.

Changes visible to kernel developers include:

The handling of block I/O errors has been simplified. There is a new bi_error field in struct bio; when something goes wrong an error code will be stored there. The two older error-handling methods (clearing BIO_UPTODATE and passing errors to bi_end_io()) have been removed.
The patch sets adding atomic logic operations and relaxed atomic operations have been merged.
The static-key interface has changed in ways that, one hopes, will reduce the number of recurrent bugs caused by confusing naming in the previous API. See Documentation/static-keys.txt for details.
The ARM architecture has a new, software-implemented "privileged access never" mode that prevents kernel code from accessing user-space addresses. With this mode enabled (the default), only accesses via the kernel's accessor functions will succeed. ARM64 also supports this mode, but it's a direct hardware mode in this case.
There are two new functions for the allocation and freeing of multiple objects from a slab cache:
```
    bool kmem_cache_alloc_bulk(struct kmem_cache *cache, gfp_t gfp,
    	 		       size_t count, void **objects);
    void kmem_cache_free_bulk(struct kmem_cache *cache, size_t count,
    			      void **objects);
```
These functions are useful in performance-critical situations (networking, for example) where the fixed costs of allocation and freeing need to be amortized across a large number of objects.
Module signing now uses the PKCS#7 message format. One change that results is that openssl-devel library (or equivalent) must be installed to build the kernel with signing enabled.
The memremap() mechanism for the remapping of device-hosted memory has been merged. Also merged is the "struct page provider" patch set (described in this article) that creates page structures for nonvolatile memory as needed.

The merge window is set to remain open through September 13, but the pace has clearly slowed. It is probably fair to say that we have seen the bulk of the changes that will go into the 4.3 kernel. That said, tune in next week for a summary of any remaining changes that slip in before the merge window closes.

Comments (3 posted)

Identifier locator addressing

By Jonathan Corbet
September 10, 2015

Companies that run huge data centers have an obvious incentive to get the most performance possible out of each of their vast number of machines. Virtualization and live migration help by allowing tasks to be moved between machines so that each can be run at full utilization, but there is a problem: how do cooperating jobs find each other as they are moved across the data center? Numerous solutions to this problem exist; the 4.3 kernel will have another one, in the form of a technology called identifier locator addressing, or ILA.

ILA, which will work only with IPv6, is built on a simple idea: each task in the data center is assigned a unique identifier that is not tied to any specific location in the net. That identifier is built into that task's IPv6 network address; the networking subsystem then does the necessary magic to route packets properly between them, changing the routing as needed as the task moves between machines.

The details of how ILA works can be found in this draft RFC, written by Tom Herbert, who also happens to be the author of the ILA patches merged into the mainline for 4.3. In short, ILA splits the 128-bit IPv6 network address space into two 64-bit fields; one contains the identifier, the other the locator. The identifier is, as described above, a unique number identifying the task in the center. With 64 bits to play with, ILA can identify enough tasks to work in even the biggest data center — for the foreseeable future, at least. The identifier is not tied in any way to any specific physical machine in the data center. The locator, instead (stored in the upper 64 bits of the IPv6 address), uniquely identifies a physical interface on the network; a packet with an ILA address can be routed across the network using just the locator field.

A task wishing to communicate with another does not know that locator, though; all it knows is the identifier of the task it needs to talk to. This task will put a special "standard identifier representation" (SIR) prefix into the locator field, while the destination task's identifier goes into the lower 64 bits. The resulting "SIR address," which does not correspond to any actual system on the net, indicates to the networking subsystem that the address uses ILA and that the true locator must be filled in at transmission time. In practice, this SIR address will likely be obtained via a DNS lookup and need not be constructed by the task at all, of course.

The task will then open a network connection to the SIR address for the service it needs to contact. The networking stack cannot route the SIR address as-is, though, since that address doesn't correspond to any specific target on the net. Instead, it must find the real machine hosting the task with the given identifier and replace the SIR prefix with a proper locator corresponding to that system. It is thus almost like performing an ARP lookup on the identifier portion of the address. Once the real destination has been placed into the locator field, the packet can be sent on its way. The receiving system will, prior to handing the packet to the application, convert the ILA address back to a SIR address by putting the SIR prefix back into the locator field.

The SIR address will be used for the duration of the connection; it will continue to work even if the addressed task is migrated in the middle. That naturally means that the identifier lookup and SIR-prefix replacement must be done on each outgoing packet. It's worth noting that SIR addresses can be used for both endpoints of a connection, but it's not mandatory. The end result of all this should be a low-overhead mechanism for virtualization of network addresses within a data center. There is no encapsulation or other trickery required; it essentially comes down to a single address-translation step.

There is one little catch, of course: the kernel must somehow keep up with the proper locator value for each identifier of interest. As documented in the networking merge commit, the table of translations can be maintained by way of some extensions to the ip command. In practice, of course, nobody who needs a technology like ILA is going to mess around with ip commands; there will, instead, be some sort of central job-management system that maintains that mapping. How mappings (and changes) will be propagated through a data center is not addressed by the code in the kernel; that's a task for higher-level software. The good news is that mappings are not expected to change all that often (task migration is expensive, so it shouldn't be done more often than is strictly necessary), so the identifier-to-locator mapping can be effectively cached most of the time.

The ILA implementation in 4.3 appears to be a bit of a work in progress. It works, but it suffers a 10% performance penalty with respect to routing without ILA. The source of the slowdown seems to be known, and Tom has promised that it will be dealt with in a forthcoming patch set. There are also difficulties in the interaction with IPvlan [PDF] that should also be fixed in the future. Meanwhile, the core of the new feature is in the mainline and available for those who would like to play with it.

Among other things, ILA is a sign that IPv6 is finally coming into its own. It was not that long ago that IPv6 would not have been considered for performance-sensitive settings like data centers; it is easy enough to use an isolated IPv4 network and avoid the performance issues and application compatibility issues that came with IPv6. But most of those issues have been resolved, and the pressure to move toward IPv6 continues to increase. As technologies like ILA come along and make use of the greatly expanded IPv6 address space, IPv6 may increasingly come to look like the more fully featured alternative.

Comments (3 posted)

The LPC Android microconference, part 1

September 8, 2015

This article was contributed by Nathalie Chan King Choy and John Stultz

Linux Plumbers Conference

The Linux Plumbers Android microconference was held in Seattle on August 20th and looked at a number of topics needing coordination between various players in the Android ecosystem. It was split up into two separate sessions; this summary covers the first three-hour session. Topics covered the state of the staging tree, USB gadgets and ConfigFS, running mainline on consumer devices, partitions and customization, a single binary image for multiple devices, Project Ara, and kdbus.

The microconference started after lunch, following the Graphics microconference, where there was also quite a bit of Android-related discussion. The Android microconference was held in two parts: in the afternoon and evening. In total, it went on for about six and a half hours until 8pm.

State of staging

After a brief outline of the schedule by microconference lead Karim Yaghmour, the sessions started with a review of the Android-related code in the staging tree and a quick discussion of what to do with it all, led by staging maintainer Greg Kroah-Hartman. Ashmem, timed_gpio and timed_output, the low-memory-killer, the sync framework, and ION were all listed, and one by one they were addressed.

Ashmem has close parallels with memfd, which is really only missing the memory-unpinning feature of ashmem. Memory unpinning, which marks memory as eligible to be discarded by the kernel, was unsuccessfully submitted upstream via the "volatile range" patches. While that effort has stalled out, it seems like ashmem going upstream as-is would be somewhat duplicative, and adding something like memory unpinning to memfd (generically or via an ioctl()/shrinker interface similar to what ashmem uses) would be preferred. The outcome of the discussion was to leave ashmem in staging for now and to revisit it again next year.

For the timed_gpio and timed_output drivers, it was claimed that no one was using them, and that the LED-trigger infrastructure should be sufficient to replace them. The Android developers mentioned that the Android vibrator hardware abstraction layer (HAL) was still using them, but shrugged about removing them. Everyone seemed to be OK with dropping it from the kernel, if only to see if anyone would complain, but many worried that vendors would just re-add it instead of moving to LED-triggers.

The mempressure notifier functionality was merged upstream to replace the low-memory-killer back in 3.10, and the Google developers did work to create a user-space low-memory killer daemon that uses it. Unfortunately, it was found not to perform suitably on real devices and the Google developers fell back to using the low-memory-killer again. Not enough debugging has been done to determine why exactly the mempressure notifiers aren't working well. So work is needed, and removing the in-kernel low-memory-killer would be premature. That said, the in-kernel low-memory-killer isn't problem-free, and has been getting bug reports and patches to change its heuristics from various different vendors. Since there is not a clear maintainer, there was a call for who might review submitted patches. Riley Andrews and John Stultz volunteered themselves.

Next up was the sync framework. As had been discussed in the Graphics microconference (Etherpad notes), sync points/fences still needed to be integrated with much of the atomic mode-setting work that was recently merged upstream. There is a need for an open-source user space before changes are merged into DRM; specifically, folks would like to see an open-source HWComposer implementation for Android that uses KMS/DRM (though there is some dispute over if that exists already or not), where that sort of sync integration would take place. So, at the moment, while it's clear it's on a number of folks' to-do lists, it's not clear that anyone is working on de-staging the code yet, so it will remain in place for now.

Finally, the ION memory allocator was discussed. As noted in the Graphics microconference, a few different proposals have been made, but there is no clear decision on what to do with them or which to prioritize. Since no real resolution was found through that discussion, folks explained there would be a mini-BOF the next day to try to hash things out. The core issue with ION is that, while it is used to solve the problem of how to allocate memory to share between different devices with different constraints, it requires that the applications using it have a detailed understanding of the constraints of the hardware, which makes it not very useful for writing applications that support different devices. There is also the problem that there aren't good examples of upstream users that need ION, but it continues to be needed for a number of out-of-tree solutions (which are difficult to merge without ION-like functionality properly upstream). New patches are being submitted against ION that add new features and try to establish things like device tree bindings, which Kroah-Hartman has been stalling for now. It shows that the longer it takes for an alternative solution to be merged, the more entrenched ION potentially becomes.

Kroah-Hartman also asked why nothing new has been added to staging, and it was explained that, while there is a fair amount of out-of-tree code used by Android, the remaining parts are not as independent as what has historically been put into staging. For the most part, what is left is stuff that modifies existing in-kernel code in a way that isn't generic enough or suited yet for upstream.

So the main outcome of the discussion was that timed_gpio would be dropped from staging, and that users should move to the LED-trigger infrastructure, or complain loudly.

USB gadgets and ConfigFS: status & future

The next topic was the "ConfigFS gadget driver" (slides [ODP]), which is a gadget driver that is controlled using ConfigFS. The Android developers are starting to use this driver as they are migrating away from the Android gadget driver, which was a forerunner to the ConfigFS gadget (see the summary from Plumbers 2013 and these slides [PDF] for further background). Andrzej Pietrasiewicz, who is the ConfigFS gadget maintainer from Samsung, outlined why runtime dynamic gadget functionality is useful: because it allows various functionality to be used in different combinations, unlike the statically configured gadgets that previously had to be defined at compile time. Pietrasiewicz then covered a bit of detail on how one configures the ConfigFS gadget to support various "functions" through the ConfigFS filesystem interface. One issue he brought up was that since ConfigFS modifications happen in multiple steps, there isn't one single moment when resources are allocated and bound in the kernel. Instead, allocations and binds are done at various different points as the gadget is configured and enabled. Pietrasiewicz sees this as somewhat problematic, and thinks that all the resources needed for a function should be generated when the directory is created in the ConfigFS mount.

At this point, Badhri Jagan Sridharan, a developer on the Android team who is working on migrating Android to use the ConfigFS gadget, agreed that resource allocation during the mkdir() operation would make more sense and require less time to switch between configurations. He had a few questions about Android's migration to ConfigFS, and asked if there were any guidelines for allocating instances versus functions and when to bind. Pietrasiewicz replied that it all depends on the function, and that there aren't any guidelines currently.

There were some further questions on details, but the session was running over time, so Sridharan and Pietrasiewicz carried on further conversation in the hallway track.

Barriers to running mainline on form-factor devices

The next session was a hurried presentation (slides [PDF]) by John Stultz from Linaro. The benefits of running mainline on off-the-shelf consumer devices (i.e. form-factor devices), while not particularly compelling to end users, are mostly for kernel and android developers. It provides a way to do continuous testing so that upstream issues can be caught early. It also brings more awareness of the unique aspects of mobile environments to upstream developers. The main barriers to doing this break down to hardware-related and software-related issues.

For hardware, the core requirements are an unlockable bootloader and access to a serial UART, which usually requires breaking the case open and soldering. Google has an interesting solution with its headphone debug UART cables on Nexus devices, but not all vendors implement it. A serial alternative mode for USB-C would be nice, but still requires standardization. Binary blobs, while not actually hardware, do effectively constrain the hardware that can be used with upstream kernels.

On the software side, the out-of-tree Android patchset has classically been a blocker but, since 3.4, enough of the Android tree has been merged upstream that limited testing can be done. Also, over the last few Android common kernels there's been steady decline in the amount of code out of tree. There are still a few substantial features out of tree but, for the most part, it's all fairly easily forward-ported so its not as much of a problem as before.

The biggest issue right now, as Tim Bird and others have been talking about, is probably lagging vendor system-on-chip (SoC) support upstream. For the Nexus device kernels that Google releases, we're looking at 1-2 million lines of out-of-tree code, and non-Google trees are apparently as bad as 3 million lines. Even so, a few vendors have been getting much better at upstreaming recently, so hopefully with this year's device releases we'll see if things have improved.

Given the above constraints, the 2013 Nexus 7 seems like a good device to target, and work is in progress. Currently the device is booting from MMC, has a working serial port, USB ConfigFS gadget support for ADB, and MTP support is there as well (with some quirks). GPIO button support for power and volume buttons as well as touch-panel input seem to be working as well. All of this is with around 32 patches, most of which are simple device-tree changes or Android build-integration changes.

Display support was hoped to have been done by now, but isn't; however, Bjorn Anderson and Rob Clark have that already working on the Sony Z3 (which is a related device). Apparently Anderson also has WiFi and the modem working on his device as well. Thanks are due to the folks who have been actually writing and upstreaming the device support required. So far, this has mostly been an integration effort, not a development one, which is due to everyone who did the hard work to make that possible.

There are still a number of areas that need work, such as display, battery charging, power management, WiFi, sensors, camera, and so on. And it's interesting that those issues are common sticking points for most SoCs, which the upstream community should take to heart. What we have upstream may be incomplete or too difficult to work with; we really need more folks working in these areas.

Benefits are already being seen from the effort, as it's helped crystallize which out-of-tree patches are most critical to get merged, and has provided a testing ground for the ConfigFS gadget transition.

Android, partitions, and customization

The next session (slides [PDF]) was by Rom Lemarchand from the Google Android team and covered some of the changes the team made for the Android One effort to ship common kernels and disk images for multiple devices. In the ideal world, there would be one user-space image per architecture, the bootloader would detect the device, populate the device-tree tables so that a common kernel with no out-of-tree patches would boot, and all would be well. But, in reality, every device has customized user space and kernels; vendor kernels have huge diffs from mainline and lots of functionality is being pushed out to proprietary user-space logic or even to trusted OS environments.

For the Android One effort, it was decided to have a single kernel and user space for every device "family", which is basically a set of devices using the same SoC with minor changes in hardware like sensors or cameras. For Android One, the /system partition should only have truly generic architecture-specific code. Since it previously contained all binaries and assets for specific devices, a common /system partition is achieved by keeping the device-specific data in separate partitions.

The /vendor partition that was introduced with the Nexus 9 provides a way for SoC-specific drivers to be included, like graphics drivers or core power-management libraries. Android One introduced a /odm partition that is meant to contain device-specific drivers like the sensor HAL, etc.

Finally, the /oem partition was introduced to store things like background graphics, ringtones, and other OEM-level customization for the device. This allows the different partitions to be updated independently (though the /oem partition is not verified and thus is not able to be updated in the current scheme), and allows each partner involved to do the customization wanted at the right level.

It was asked if the /odm partition could contain kernel modules. Lemarchand said that it could and the immediate follow-up question was who signs those modules. He clarified that while the original design manufacturer (ODM) signs the /odm partition and can do over-the-air (OTA) updates independently, any kernel modules included in that partition have to be signed by Google to be loaded since the kernel uses signed modules. Thus vendors can't create their own signed kernel modules.

Mark Gross from Intel also brought up a pain point that he has: in order to test with new kernel modules, he has to re-flash the system partition with those modules, which causes dm-verity to fail. He'd like to see a untrusted partition for kernel modules, since those are already signed. It was asked if being able to load custom modules remotely via ADB would help; Mark said that it would be interesting, but probably not helpful since the whole kernel and all the modules would need to be updated and tested together.

Running a single Android binary image on multiple devices

The next talk (slides [PDF]), which continued on the theme, was by Samuel Ortiz and covered the work done by Intel's Open Source Technology Centre on the Intel Reference Design for Android (IRDA) effort. He said that IRDA was similar to Android One, but for x86 tablets. The goal was to minimize changes to the Android Open Source Project (AOSP) code and to support fairly small hardware differences like changes to GPS, WiFi, and Bluetooth.

The first major issue the project ran into is that, with the current build system design, changing any single device results in changes in the user-space images. And there are many issues in trying to solve this. For instance, kernel modules can help, but Android's insmod lacks the module-dependency handling normally found in modprobe. There's no easy way to dynamically select different HALs, so for devices that have different types of GPSes, you have to have different HALs for each one. For subsystems like WiFi, there are different wpa_supplicant binaries needed to support different vendors. And some things require configuration values or permission files generated at build time. So the desire is to find some way for this to all be done dynamically at boot time, rather than at build time.

IRDA's approach is to use an autodetection daemon that uses ACPI tables to detect the hardware on the device, then, for that hardware, it triggers predefined actions. The project modified libhardware, so that it talks to the daemon to determine which HAL should be used for the device. For some hardware, like GPUs, the drivers need to be in specific locations, so its image includes multiple HAL drivers in different directories, then its daemon bind mounts the appropriate HAL directory into the right place in the file tree. It also includes a FUSE filesystem driver for /etc/permissions/ that generates the right XML permissions files for the device at runtime.

He ran through some examples of how this works for specific devices, and also outlined some areas where the most trouble was seen. Dynamic HAL selection is probably the biggest issue and something that would help would be to have better reference HALs in AOSP, as most vendors are shipping majorly hacked up implementations, usually based on ancient versions. The project has worked on improving Bluetooth support upstream so that the AOSP reference HAL can work more generically and there isn't the need for vendor-specific Bluetooth HALs. Solving kernel-module-dependency resolution would be nice to have as well.

Yaghmour, of Opersys, asked if this worked when devices disappeared and reappeared at runtime, or if it was mostly for boot time only. Samuel clarified that it could be used with dynamic changes at runtime, but various XML permissions files need to be removed. Yaghmour noted that parts of the framework are unhappy when devices disappear.

It was asked which of the reference HALs Ortiz saw as worst overall. To that, he replied "the binary ones" and wished that the reference HALs were more robust, allowing hardware vendors to not waste time with implementing bad custom HALs. He called out GPS, Bluetooth, NFC, and Graphics specifically as being bad and noted he couldn't point to any one AOSP reference HAL being widely used. Though Brown noted that the Audio HAL was fairly good and, while vendors do tend to copy it, it's reused for the most part, so that's progress.

Dmitry Shmidt from Google asked if the probing and autodetection impacted boot time. Ortiz replied that it varies but can be on the order of 500ms or so, which, for the Android boot time, is quite a small percentage.

Adapting Android for Ara

Karim Yaghmour was up next with his talk (slides [PDF]) about Project Ara and how it's adapting Android for the project's needs. After clarifying that he doesn't speak for anyone but himself, he began by covering the overall goals of Ara, which are to introduce more variety into the hardware ecosystem and to provide a platform for hardware developers that is something like what the mobile app platform is for software developers.

Yaghmour provided an overview of some of the most interesting technologies being used by Ara, starting with the MIPI UniPro interconnect that is used as the bus to connect the modules. He also covered the endoskeleton frame, the contactless capacitive connectors used by the modules, as well as the Electro-Permanent-Magnets (EPMs). The endoskeleton connects all the modules and basically acts as a UniPro switch, allowing for even direct module-to-module communication. The EPMs are currently used to hold the modules to the endoskeleton, but the news of the day was that the EPMs weren't working out for the project. Each module can either consume or provide charge, allowing for battery modules to be included with other module functionality, and allowing for multiple batteries to be used together. Finally, each module can have a custom printed cover, which allows for even further user-directed design.

Yaghmour noted that UniPro supports a number of standard protocols, but for Ara there was quite a lot that needed to be developed. So the project created Greybus, which provides the protocol that handles setting up the switch, module notifications, power-management, and standard class protocols for a number of different devices (camera, audio, Bluetooth, GPS, etc.). Since initially there weren't devices implementing these class protocols, it also provides "bridged PHY" native protocols like i2c, USB, UART, SDIO, and so on, which can be run over Greybus. Yaghmour also noted that, while hardware isn't out yet, there is the gbsim simulator for Greybus, which folks can check out now if they're interested.

There were some questions on how Greybus handles dynamically adding buses that cannot be probed like i2c, and to what extent the existing drivers needed to be changed. After some confusion about the question, Yaghmour said that the module provides a manifest of what it contains, and the related Greybus driver creates the platform device based on that knowledge of the module, so that the existing driver can be reused with as little change as possible. It was also noted that the plan is that the devices would use the class drivers, so the bridged PHYs are sort of a short-term solution until class-based hardware is available.

Yaghmour then talked about how Ara was extending Android to handle dynamic device changes at runtime. He clarified that, after some various proposals were run by the Android team, the Ara project settled on adding the dynamic device handling in the HAL layer, so that changes to the framework layer could be minimized. It will then work on each HAL layer to try to extend it so the framework can handle when devices are unplugged. For the most part, this means it supports the HAL interfaces for devices even if the device isn't present. So, for the camera when the camera isn't loaded, it provides a "no camera" screen to the interface. This allows iterative changes through each subsystem, which will likely have more success than trying to change all the subsystems at once.

There was a comment from Marcel Holtmann, the Bluetooth maintainer from Intel, that this seems problematic since, for some functionality like Bluetooth, the pairing is bound to the physical hardware address of the radio. Yaghmour agreed and noted that if you replace your Bluetooth module with a different module, you would have to re-pair your devices.

There was also a question of where the sensors live; Yaghmour answered that they are usually paired with other devices. An example is that the rotation sensor is included with the screen, which was useful because it allowed a known orientation in space (i.e. which way is up), whereas if it were on other modules, it could be inserted in different slots or with a different orientation, so the system would have to figure out the module orientation relative to the display to understand the sensor data.

Yaghmour then gave a brief recap of the previous sessions, noting that there are a number of different approaches being used to try to make Android more generic and to allow it to support more devices without custom builds. The first, by Lemarchand, handled supporting different devices at flashing time by using a combination of generic and hardware-specific partitioning. The second, by Ortiz, talked about supporting different devices at boot time from a single disk image, using bootloader/firmware data to do bind mounts and FUSE filesystems. And, finally, his talk covered handling different devices at runtime through extending the HAL layer to support hotplugging standard classes of devices.

Integrating kdbus in Android

The last talk (slides [PDF]) of the afternoon session was from Pierre Langlois from ARM about his summer project of trying to run Android's libbinder over kdbus. Langlois covered some potential benefits of the project, which included alleviating the concern of duplicating work and maintainership with two new interprocess communication (IPC) mechanisms being introduced into the kernel, the curiosity of whether it could work, and what each side might be able to learn from trying to run libbinder over kdbus. Being a summer project, Langlois did limit the scope of his work to just see if it was possible to make libbinder run over kdbus; security implications or performance were not a focus.

He provided a basic outline of how binder works, with a simple example of how one might add two numbers together using binder. He also covered how the remote services are named, registered, and found through a special process called the ServiceManager that all processes register with. Langlois then dived into details of the binder kernel driver, and how it maintains per-process memory pools, worker threads, reference counts, and dispatches data from one process to another. The kernel driver also has a special call to identify the ServiceManager process. All of the binder kernel interfaces are abstracted out by the libbinder library.

He then covered an overview of kdbus and what it provides, noting that kdbus keeps track of the service naming internally in a registry, which can be probed by processes to find services, and can also can provide notifications. His assessment was that it was sufficient to implement binder transactions, so Langlois implemented libkdbinder, which is a drop-in replacement for libbinder that is implemented using kdbus.

There are some conceptual differences that had to be resolved. Since kdbus provides a service-name registry, Langlois got rid of the ServiceManager component in Android and provided compatibility calls into the kdbus service-list functionality. Also, since binder normally handles thread pools to service requests, and kdbus does not manage worker threads, libkdbinder needed to spawn off and manage the threads itself. With these issues worked out, he was able to successfully parcel up binder transactions, send them with kdbus synchronous messages, and unparcel the returned results.

So far, he has a working proof of concept that passes the binder test cases found in AOSP. This progress actually surprised some of the Android developers, who quickly asked if it supported some of the more obscure binder features. Langlois noted that, while the really obscure weak-references feature, which doesn't actually seem to be used in binder, isn't implemented, most of the other functionality is there.

He then mentioned that he did try booting Android with libkdbus to see if it would work, but ran into one unexpected problem: kdbus seemed to fail when passing ashmem file descriptors over it. It seems that kdbus only allows for sealed memfd file descriptors to be sent, which is more restrictive than what Android uses for ashmem file descriptors shared over binder. Folks in the room were a little confused, as they thought it really ought to work, but there wasn't anyone who could speak authoritatively on why this issue was a problem. Langlois thought that probably kdbus needs to be extended to handle ashmem file descriptors, but had run out of time to debug the issue further. He suspects that after the ashmem issue is resolved, there will probably be other blockers that pop up, but in the end he thinks getting it working to feature parity is entirely doable.

For future work, Langlois sees reviewing the security implications as a high priority. There were also questions of allowing for more than one bus, since running binder names on a desktop kdbus might cause problems. Folks in the room seemed confident that it would be easily doable to implement a separate binder kdbus instance.

Andrews, who is one of the binder maintainers from Google, was a little skeptical that performance parity would be possible, as binder is well-optimized for its use case. One of his current tasks is reworking the binder kernel driver to break apart some of the locking that is causing lock contention on devices with a higher CPU count. One thing he did note is that in doing his rework, he's planning to greatly extend the binder test cases, so that he can be confident he doesn't break anything, and that work will likely provide better test cases for libkdbinder to be tested against.

The code isn't yet released, but Langlois and Serban Constantinescu from ARM were working to get the code released soon and possibly plan to hand the project off to Linaro. Stultz pressed that getting this implementation out and being able to provide analysis of the performance difference or any restrictions (like the ashmem issue) that kdbus has would be important. Since, if there any changes that might be needed in kdbus, it would be particularly useful to know about them soon, as it could affect the conversation with regard to kdbus getting merged.

And with that, the first session was over, and folks took a break. The second session will be summarized in an upcoming article.

[Thank you to all the presenters for their discussions, Karim Yaghmour for organizing and running the microconference, and Rom Lemarchand for helping get so many of the Google Android team to attend.]

Comments (1 posted)

Patches and updates

Kernel trees

Kamal Mostafa Linux 3.13.11-ckt26 ?

Architecture-specific

Jungseok Lee Implement IRQ stack on ARM64 ?

Julien Grall xen/arm64: Add support for 64KB page in Linux ?

Takao Indoh x86: Intel Processor Trace Logger ?

Vikas Shivappa Intel cache allocation and Hot cpu handling changes to cqm, rapl ?

Core kernel code

Parav Pandit devcg: device cgroup extension for rdma resource ?

Nikolay Borisov Containerise nproc count ?

Daniel Wagner Simple wait queue support ?

Development tools

Josh Poimboeuf Compile-time stack metadata validation ?

Device drivers

Lee Jones remoteproc: Add driver for STMicroelectronics platforms ?

Chee Nouk Phoo 0/2] Add Altera Modular ADC Driver ?

Enric Balletbo i Serra Add initial support for slimport anx78xx ?

Matt Ranostay iio: new chemical sensor framework and channel types ?

Sanchayan Maity Implement OCOTP driver for Vybrid using NVMEM ?

Vaibhav Hiremath mmc: sdhci-pxav3: Enable support for PXA1928 SDCHI controller ?

Clément Vuchener Add Corsair Vengeance K90 driver ?

Enric Balletbo i Serra Add support for tps65217 charger ?

Alex Smith mtd: nand: jz4780: Add NAND and BCH drivers ?

Bayi Cheng Mediatek SPI-NOR flash driver ?

Chunfeng Yun Mediatek xHCI support ?

Justin Chen watchdog: driver for BCM7038 and newer chips. ?

Cyrille Pitchen add driver for Atmel QSPI controller ?

Kuninori Morimoto misc: add CS2000 Fractional-N driver ?

Javi Merino Devfreq cooling device ?

Device driver infrastructure

Tomeu Vizoso On-demand device probing ?

Javi Merino Let the power allocator thermal governor run on any thermal zone ?

Lina Iyer PM / Domains: Generic PM domains for CPUs/Clusters ?

Michael J. Coss kobject: support namespace aware udev ?

Junghak Sung Refactoring Videobuf2 for common use ?

Filesystems and block I/O

Andreas Gruenbacher Richacls ?

Anna Schumaker VFS: In-kernel copy system call ?

Sheng Yong UBIFS: add ACL support ?

Memory management

Kirill A. Shutemov THP refcounting redesign ?

Ebru Akagunduz mm: make swapin readahead to gain more thp performance ?

Eric B Munson Allow user to request memory to be locked on page fault ?

Security-related

Jari Ruusu Announce loop-AES-v3.7e file/swap crypto package ?

Tycho Andersen c/r of seccomp filters via underlying eBPF ?

Miscellaneous

Hajime Tazaki an introduction of Linux library operating system (LibOS) ?

Wang Nan perf tools: filtering events using eBPF programs ?

Karel Zak util-linux v2.27 ?

David Sterba Btrfs progs release 4.2 ?

Stef Bon Announce: fuse-backup ?

Pavel Emelyanov Checkpoint-restore tool v1.7 ?

Mathieu Desnoyers Userspace RCU 0.7.15 and 0.8.8 ?

Page editor: Jonathan Corbet
Next page: Distributions>>