Kernel development
Brief items
Kernel release status
The current development kernel is 4.3-rc1, released on September 12. "I decided that I'm not interested in catering to anything that comes in tomorrow, and I might as well just close the merge window and do the -rc1 release." In the end, 10,756 non-merge changesets were pulled during this merge window.
Stable updates: 4.1.7, 3.14.52, and 3.10.88 were released on September 13.
Quotes of the week
Kernel development news
4.3 Merge window, part 3
Last week's merge window summary ended with a guess that the bulk of the changes for 4.3 had been seen by that point. By the time Linus released 4.3-rc1 and closed the merge window, 10,756 non-merge changesets had been pulled into the mainline repository; that's about 550 since last week. So the patch rate did indeed fall off as expected, but there were still some significant changes that slipped in before the window closed.Significant user-visible changes include:
- The new file /proc/kpagecgroup can be used to determine which
memory control group each page of physical memory is charged to.
- The idle-page tracking feature has
been merged. This feature enables the discovery of memory pages that
are not in active use; this information can be used to optimize the
allocation of memory between containers or virtual machines.
- The membarrier() system call,
which has been circulating since (at least) 2010, has been merged.
See this
commit for the latest description and man page.
- The control-group writeback improvement
patches have been merged.
- New hardware support includes Microchip LAN88XX PHYs, NXP LPC18xx/43xx watchdog timers, Atmel SAMA5D4 watchdog timers, Toradex Colibri VF50 touchscreens, and Freescale i.MX6UL touchscreen controllers.
get_vaddr_frames()
With regard to changes visible to kernel developers: there has been one significant addition to the memory-management interface. Certain driver subsystems (media, for example) have long had to reach deep into the memory-management subsystem to implement high-performance I/O to user-space buffers. Some of the resulting code raised eyebrows in memory-management circles; it also stood in the way of efforts to make the mmap_sem semaphore less of a bottleneck.
Jan Kara has been working on reducing mmap_sem use for a while; that effort has extended into improving the memory-management primitives used in the driver tree. In 4.3 he has added a new set of helpers for the easy mapping of I/O buffers. It creates a new type, struct frame_vector, to describe a mapped buffer. That structure lives in <linux/mm.h>, but it's probably best if most users treat it as if it were an opaque structure.
A frame vector can be allocated or destroyed with:
struct frame_vector *frame_vector_create(unsigned int nr_frames);
void frame_vector_destroy(struct frame_vector *vec);
Here, nr_frames is the maximum number of pages that will be mapped using this vector. The mapping itself is done with:
int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
bool write, bool force, struct frame_vector *vec);
The beginning virtual address of the buffer to be mapped is passed in start, and the size of the buffer in nr_frames. If write is set, the buffer will be mapped for write access; if force is set, write access will be set up even if the buffer is mapped read-only in user space. After a successful call, vec will contain the results of the mapping, and the return value will be the number of pages actually mapped.
If the buffer lives in ordinary memory, get_vaddr_frames() will take a reference to each mapped page to keep it in RAM. That reference must be released at some point to unpin the pages; to do so, call:
void put_vaddr_frames(struct frame_vector *vec);
Note that frame_vector_destroy() does not make this call; users must take care to do it themselves.
Once upon a time (i.e. last year), this type of interface would have returned an array of struct page pointers to refer to the mapped pages. The increasing use of not quite real memory in hardware has created pressure to use page-frame numbers (PFNs) instead. As it happens, the contents of the frame vector may be either struct page pointers or PFNs, depending on the type of memory mapped. Driver-level code need not be aware of this practice, but it does have to be explicit about what it wants. To gain access to the mapped buffer (for DMA mapping, for example), use one of:
struct page **frame_vector_pages(struct frame_vector *vec);
unsigned long *frame_vector_pfns(struct frame_vector *vec);
The call to frame_vector_pages() can fail if it is not possible to represent the buffer using page structures; the error code will be returned as an ERR_PTR() value, so a macro like IS_ERR() should be used to check the returned pointer before using it. Conversion to PFNs is unconditionally successful in the current implementation.
All of the above functions are exported to modules (with no explicit GPL limitation). Driver code has been fixed up in a number of places (example) to use the new interface; the result is a net reduction in lines of code and, hopefully, an improvement in robustness.
In summary
Meanwhile, the 4.3 development cycle has now entered the stabilization phase. For the curious, Stephen Rothwell's post-merge-window summary gives some statistics about the changes that were merged this time around. It seems that 94% of them were in the linux-next tree at the beginning of the merge window, with the DRM and networking trees being the biggest source of commits that weren't there. Some 587 commits in linux-next didn't make it into the mainline; nearly a quarter of those come from the kdbus tree, which was not proposed for merging this time around. Also not merged was OrangeFS, which ran into some trouble when the pull request went out; chances are good OrangeFS will make it in for 4.4.
Extended system call error reporting
The interface between the kernel and user space is, in places, surprising in its complexity. There are numerous tasks that involve passing detailed information about hardware configurations, process state, and more, in either direction. When something goes wrong, though, that communication channel narrows to a single integer error code, often making it difficult for developers to figure out what is going on. There have been various proposals for widening the error-reporting interface in the past; the latest proposal, from Alexander Shishkin, may not get any further than its predecessors, but it does show that there is ongoing interest in the problem.As an example, consider the VIDIOC_S_FMT ioctl() call, provided by the media subsystem. Its job is to set the format of images returned to user space from a capture device (such as a camera). There is a mind-boggling variety of possible image formats out there, so the format description passed to the kernel from user space contains a number of complex, interrelated parameters. There are a lot of ways such a description can go wrong — and that's before the vagaries of specific driver implementations are taken into account. Should there be a problem, though, the only thing user space is likely to know is that the VIDIOC_S_FMT call returned EINVAL. The kernel, of course, knows what it was objecting to, but there is no way to communicate that knowledge to user space.
Fixing that problem is not easy; the errno mechanism is clearly inadequate, but it is set in stone by several decades of Unix tradition and cannot be changed without breaking applications. So any extended error information must be carried by a new channel that can be ignored by unaware applications. The addition of error information to the kernel must also be done carefully, so as to avoid slowing down kernel hot paths or clogging the source with an overwhelming set of error messages. Alexander's patch attempts to meet all of these criteria.
How the mechanism works is best illustrated with some examples. In his patch set, Alexander targets the perf_event_open() system call; it takes a perf_event_attr structure as a parameter. That structure has a vast and growing set of parameters describing the events to be captured and, correspondingly, there are a lot of ways in which things can go wrong.
Describing errors
The first step is to create a structure that describes an error site — a place where an error is detected and passed back to user space. That structure should include a field called site that holds an ext_err_site structure; beyond that, it can contain any information needed to fully report the error to user space. In the perf case, that structure looks like this:
#include <linux/exterr.h>
struct perf_ext_err_site {
struct ext_err_site site;
const char *attr_field;
};
The attr_field member is meant to hold the name of the field in struct perf_event_attr that generated the error.
Then, it is necessary to define a function that can turn any extra information in this structure into a string to be passed back to user space. The perf version is:
static char *perf_exterr_format(void *site)
{
struct perf_ext_err_site *psite = site;
return kasprintf(GFP_KERNEL, "\t\"attr_field\": \"%s\"\n",
psite->attr_field);
}
Note that this function returns a dynamically allocated string; the extended error infrastructure will free that string when it is no longer needed.
With these two pieces in place, it is possible to define an "error domain" that handles a specific class of errors — perf errors in this case:
DECLARE_EXTERR_DOMAIN(perf, perf_exterr_format);
The actual reporting of error information is done by way of a rather frightening bit of macro magic called ext_err(). Real users will almost certainly wrap it, though; this is how it is done in the perf code:
#define perf_err(__code, __attr, __msg) \
({ /* make sure it's a real field before stringifying it */ \
struct perf_event_attr __x; (void)__x.__attr; \
ext_err(perf, __code, __msg, \
.attr_field = __stringify(__attr)); \
})
The parameters to ext_err() are the name of the domain defined above, the (negative) error code, a message to be reported to user space, and a set of initialization strings to initialize the rest of the error-site structure. In this case, the final parameter to ext_err() sets the attr_field of the perf_ext_err_site structure to the name of the erroneous attribute. See this patch for an actual invocation of the perf_err() macro.
There are a couple of other important details. One is that the EXTERR_MODNAME symbol must be set separately before calling ext_err():
#define EXTERR_MODNAME "perf"
The other is that ext_err() returns a value, which is a modified version of the error code passed into it. This code can be thought of as an index into an array of ext_err_site structures describing every extended error known to the kernel. The normal way to return the error code to user space will then be through a line like:
return ext_err_errno(code);
The modified code from ext_err() must not be returned directly to user space, since applications will have no idea what it means. On the other hand, the original error code should not be returned without calling ext_err_errno(); that call is the one that causes the kernel to remember the extended error information. In short, there is a new task_struct field called ext_err_code; the call to ext_err_errno() causes the special error code to be placed into that field. If an ordinary (non-extended) error code is passed to ext_err_errno(), the right thing will happen, so it is safe to use in situations where a support code might return ordinary or extended error codes.
The user-space side
At this point, the kernel is prepared to tell user space about an extended error, but the return value from the system call can still only be the ordinary errno value that it has always been. If the application wants to know more, it can make a call like:
char message[SIZE];
len = prctl(PR_GET_ERR_DESC, message, SIZE);
The return value is not just an ordinary message; it is a string in the JSON format containing the file and line where the error was generated, the error code, the module name, the actual message, and any specific information added by the domain format function described above. The changes to the user-space perf tool duly include a new JSON parser to pick this message apart again. The prctl() call will clear the error information on the kernel side, so a second call will return no data.
The patch set has, thus far, not seen much in the way of review comments. In the end, the error-reporting issue is one that most developers recognize, but few feel up to trying to fix. So it is hard to say whether this attempt to widen the error-reporting channel from the kernel will meet with success or not. Ancient traditions are hard to change but, every now and then, somebody succeeds.
The LPC Android microconference, part 2
The Linux Plumbers Android microconference was held in Seattle on August 20th. It included discussions of a variety of topics, many of which need to be coordinated within the Android ecosystem. The microconference was split up into two separate sessions; this summary covers the second session, which was held for three hours in the evening. Topics were toybox in Android, improving AOSP vendor trees, providing per-task quality of service, and improving big.LITTLE on Android.
The first session's summary can be found here.
Toybox in Android
Rob Landley started the second half of the microconference with some background on the toybox project, which has recently been included in the Android Open Source Project (AOSP), replacing some components of Android's toolbox. Landley described his early attempts to learn how to build a distribution from scratch, starting in 1999, which he would use to create his Aboriginal Linux project. He picked up the BusyBox tool for his project and, after improving it, became the maintainer.
It was after the GPLv3 came out that the trouble started, Landley said. Some code contributed to BusyBox was GPLv2-only, which caused trouble with relicensing. So efforts to audit the code, then remove and replace any GPLv2-only submissions, were made. He became disillusioned with what he saw as the "breaking" of "the GPL", as no longer was there just "the GPL", and it was no longer a universal receiver of code. In the past, most open-source licenses were GPL-compatible and could be up-converted to GPL as needed. With the introduction of GPLv3, along with lots of GPLv2-only code in existence, suddenly this was no longer the case. He handed over BusyBox maintainership to the best person he knew for the job, and started playing with toybox, which he wrote under what he calls a "BSD zero-clause" license—basically public-domain—which would allow the code to be compatible with any other license.
However, soon after starting toybox, he mothballed the project. This was mostly due to the fact that he liked the new BusyBox maintainer and didn't want to undermine the project by leaving and immediately starting a competitor. Also BusyBox had a ten-year head start, which made it intimidating to try to catch up.
Landley then started talking about a paper he had co-written a decade ago that described how big transitions in computing are always opportunities, since big established players are often unseated by smaller upstarts. The paper was focused on the transition to 64-bit systems and how this was an opportunity for Linux to unseat Windows.
Landley realized that the transition to mobile was a similar major transition, and Apple's iOS was likely to eventually unseat existing workstations if it were to become dominant. He saw Android as thus becoming an important vehicle for the preservation of a non-proprietary future. His history with Aboriginal Linux made him interested in trying to make Android self-hosting. Unfortunately, due to Android not accepting any GPL code in user space, BusyBox would not be able to be used, but toybox could be. Thus, he restarted development on toybox.
At this point, Karim Yaghmour had a few questions, the first being: how far is toybox currently away from BusyBox? Landley pointed to a online status page that explains (with a somewhat complicated key) what parts are left to do. There is also a roadmap, which helps show the scope of what the project is trying to achieve. That list includes: POSIX-2008, Linux Standard Base (LSB), a self-hosting environment, as well as fully replacing Android's toolbox and Tizen's core tools.
Yaghmour also asked which commands from toybox have not yet replaced the equivalents in Android's toolbox. Again, he referred people to the status page, but Landley also pointed out that you can look in AOSP at the toolbox Android.mk file, which will clarify which non-toybox tools are still being used. Landley noted that Elliott Hughes of the Android team has been helping with some of the testing of toybox. Hughes syncs upstream toybox to the AOSP tree every two weeks.
At this point the talk ran out of time.
Improving vendor AOSP repositories
The next talk (slides [PDF]) was a discussion led by John Stultz from Linaro on some of the issues he's seen in various vendor trees, and what might be done to improve things. Most vendors utilize a fork-and-try-to-forget model with AOSP, targeting each device with its own tree. Linaro isn't any different, really, as it maintains quite a number of separate AOSP trees for efforts like 96board support, work on Project Ara, as well as for other projects. This behavior is problematic, since it makes operating system updates more complicated, so vendor device updates suffer. That results in real security difficulties, as recently seen with the Stagefright issue.
Additionally, devices do tend to include functionality from a number of different vendors: a system-on-chip (SoC) from one vendor, Bluetooth and WiFi from another, sensors from another, etc. As noted in previous talks in the microconference, integrating vendor hardware abstraction layers (HALs) into an AOSP repository is usually a non-trivial process. HALs do not integrate into the tree in a uniform way and some vendor HALs require tweaks to the framework layer. This all complicates things further when it comes time to handle a release update, since there are now multiple parties that are being depended on for updates, which all have to be integrated together.
Another area of pain is the build system. The device.mk and BoardConfig.mk files used to describe the device usually contain a large set of global build variables, which have no expressions of inter-dependencies (for example, enabling BOARD_HAVE_BLUETOOTH_BCM won't do anything if one forgets to enable BOARD_HAVE_BLUETOOTH), and the device.mk files that list out the PRODUCT_PACKAGES to be included have a ton of duplication from device to device. In fact, it seems most vendors create mid-layer common directories and have logic to try to share some of the standard entries for different devices, just to avoid the heavy duplication required.
AOSP is not really structured to be able to host community contributions, which is another issue. Google has reasonably limited AOSP to hosting device-specific code only for Nexus devices, which the company is able to test and maintain. While this is a understandable and practical decision for Google, it keeps vendors and others in the community working in their own private trees. Since there's no reason to submit code, this results in a limited culture of review, so there isn't necessarily a community sense for what is good or bad code.
Further, there's a missing sense of best-practices. This may not be true for some vendors who are closer partners with Google, but for the wider community that can't attend the private boot-camp meetings, it is. There is a lack of documentation, so things like HAL integration approaches often end up being done in a cargo-cult manner. Improving documentation and having better examples are areas that need work.
After running through the issues, the discussion was opened up to try to see what could be done to improve things, and not just from the "Google should do X" angle, but also what the community-at-large could potentially implement.
One question was "should HALs be submitted to AOSP?", but since AOSP doesn't accept non-Nexus device support, this didn't seem appropriate. Additionally, many HALs are completely proprietary, so licensing issues would prevent that from happening. There is also the fact that vendors are quite often focused on shipping products, so it's not clear that, even if the code was welcome to be submitted for review, many vendors would take the extra time required to deal with the feedback of that code review. So the angle of finding processes to make things easier for the vendors should be considered.
There was also a suggestion that the community create some space where code that couldn't go into AOSP be collected. This might be a possibility, but it would be good to avoid the "creating another fork to solve all the forks" sort of solution.
One idea for the build system is to try to reduce the amount of duplicative code in the device directories with something like Kconfig. This would help express configuration dependencies, reducing the number of options required to be specified, and ideally make it easier to build for multiple devices with only a change to the configuration file. Samuel Ortiz mentioned that Intel basically does this, though it doesn't have the dependency tracking. It uses a configuration file to define the device and some common infrastructure in the tree processes that file. Stultz noted that many vendors have something like this to make it easier to build multiple devices; it points to something that may need to be shared generically.
Dmitry Shmidt of Google's Android team asked how Linaro handles doing validation for devices it doesn't have access to. The answer was that it doesn't, and it's a problem. However, another point was raised that the Linux kernel deals with this all the time, since people don't have test machines for all architectures, and it's handled by delegating testing responsibility to architecture maintainers. It was noted that a perspective from the Chrome OS folks might be useful, as they've been able to delegate testing for a wider array of devices.
Rom Lemarchand from Google mentioned that he would like to see more vendors submit code to AOSP. But some attendees said that it is hard to get patches reviewed on Gerrit. A number of folks in the room agreed, saying they had run into similar problems. The Google developers said that it sounded like maybe the auto-adding of maintainers in Gerrit was broken. They promised to look into it and get it solved quickly.
While Google controls commit access to AOSP, its Gerrit installation allows anyone to review and comment on proposed changes. It was proposed that a group of non-Google folks could make a pointed effort to review patches submitted to Gerrit, which would help build up a better community sense of code taste and might even lead to growing external maintainers. Lemarchand mentioned that it would also be nice because it would allow the Google team to better understand whose reviews could be trusted in the community.
The discussion sort of dwindled at this point, so Stultz suggested moving on to the next talk.
Providing per-task quality of service
Next Juri Lelli from ARM talked (slides [PDF]) about his work on the energy-aware scheduler (EAS) and the potential for use of deadline scheduling in Android. He started with an overview of the EAS, describing how its focus is on per-entity load tracking and per-entity utilization tracking, which try to size up tasks so that they can be properly placed on the right core in an asymmetric multi-processing environment. The goal is to use as little energy as possible while still getting good performance. When the system is not over-utilized, the scheduler will try to pack tasks onto a single CPU, but once a tipping point is crossed and that one CPU is overloaded, it falls back to the conventional approach of spreading the load around.
He described how EAS ties more of the CPU-power logic like dynamic voltage and frequency scaling (DVFS) together into the scheduler, by allowing the scheduler to trigger cpufreq governors directly and make decisions about the speed of each CPU that is being scheduled. This allows for the scheduler to ramp up a processor's frequency to increase capacity if it wants to add a task to that CPU. He also talked a little bit about schedtune, which provides a single sysfs knob to boost performance on a global or per-cgroup basis.
At this point it was noted that most of the EAS discussion focused on energy, while the interactive cpufreq governor in Android is mostly focused on latency, so it was asked: how does latency come into the picture and how can it be controlled? Lelli noted that the interactive cpufreq governor tends to boost frequency quickly to provide interactive latency benefits to the user. The schedtune sysfs knob allows for similar boosting, so when an event occurs Android user space could provide some short-term boosting to improve latency response. Riley Andrews from Google noted that the interactive cpufreq governor gives this benefit without user space having to do anything. Though Lelli pointed out that the governor does it a bit blindly, so this allows a more informed and focused response that could save power.
Andrews was also curious about all the various knobs that were available via EAS. Lelli noted that the out-of-tree HMP scheduler, used by some vendors for big.LITTLE machines, had way too many knobs and was very difficult to tune. So, with EAS, more of the logic has been kept internal in the scheduler so it is easier to get it right. Andrews also wondered about the heuristics for task placement. Lelli noted that there are heuristics and the scheduler needs to make a guess and try to compute the difference in energy used so it can try to give the best tradeoff in performance and energy usage. It was suggested that for more information, folks attend the EAS microconference session that would be going on the next day.
Lelli then switched to the deadline scheduler, which he has been maintaining with others. He mentioned some of the benefits of SCHED_DEADLINE over other policies like SCHED_FIFO, like how it avoids starvation issues that SCHED_FIFO can have. It works using a resource-reservation system that guarantees that a task will get a specific amount of CPU time in a given period. The scheduler just needs to be provided the runtime, period, and deadline values; then it will indicate if it can achieve those constraints or not.
He showed graphs of the performance implications of this model for tasks like movie playback and wanted to know if folks were interested in using this for Android. His thought was that SurfaceFlinger or AudioFlinger might be able to use this policy. One question that came up was how does the deadline scheduler's reservation system handle CPU frequency changes. Lelli replied that it doesn't at the moment and that there needs to be some integration of the policies, so that the deadline scheduler can specify a minimum frequency in order to guarantee that the deadlines are met.
There was also some discussion on how it would interact with SCHED_FIFO tasks. It was clarified that SCHED_DEADLINE runs at a higher priority than SCHED_FIFO, but it can be used so that you can, for example, give a guaranteed limit of 10% of runtime to SurfaceFlinger. That means you don't have to deal with starvation issues, which had been mentioned as being a problem in making the audio pipeline run as a SCHED_FIFO task at Google I/O. Andrews said that he had looked at it for SurfaceFlinger earlier, but had some issues with the interface and the cpufreq issue was problematic as well. There were a few more questions, but it was suggested that they be brought up at the EAS microconference the next day.
Improving big.LITTLE on Android
Todd Kjos (who was standing in for Tim Murray) from Google's Android team then started reviewing the team's experience dealing with big.LITTLE devices (slides [PDF]). For the most part, big.LITTLE issues and the complex tuning required have been left to the vendors to sort out, as Google hasn't directly addressed it so far. However, that's starting to change. He showed a few examples of the complexities of different styles of asymmetric CPUs that they have seen, from more standard big.LITTLE pairings to more complicated ones, where there might be different sets of "little" CPUs running at different speeds, in addition to the "big" CPUs.
He showed some graphs of the power-performance curves for the different CPUs, which may have cross-over points where it becomes obvious it's worth jumping from the smaller CPUs to the bigger ones. However, he also showed charts where there might be a gap between the two curves, making tuning much more complicated, since to gain any performance at that gap, you have to make a much bigger jump in power consumption.
He described some of the changes in Android M to help. Android keeps track of which tasks are foreground and which tasks are background tasks, so it now uses cpusets to help pin all background tasks to the small CPU or CPUs.
Bobby Batacharia from ARM asked if all asynchronous tasks are considered background tasks. Kjos and Andrews responded that not all are, it depends on which tasks they're interacting or running with. Mark Gross then asked how much work is done in background tasks. The Android developers said not much, but it can be intermittent, though they didn't have hard numbers.
Andrews then mentioned that the team really does want to better enforce the notion of foreground and background tasks. Even when background tasks are limited by cpusets and lower priorities, some still cause scheduling interference with foreground tasks. Stultz asked if pushing foreground tasks to SCHED_RR (the round-robin realtime scheduling class) would solve this, but Andrews noted that the team was really avoiding making tasks with non-deterministic runtimes run as round-robin.
Stultz then asked how the cpusets interacted with Android's use of cgroups. Kjos clarified that they help together. Android still uses cgroups to regulate task CPU usage, but the cpusets help pin tasks to processors, so the tasks don't otherwise fan out to the big CPUs. Batacharia noted that EAS should help with that problem.
As the session wrapped up, Kjos indicated that these types of systems will be an area of greater focus for his team in the coming year.
At that point, after over six hours, the microconference came to an end, and everyone quickly left for dinner and drinks.
[Thank you to all the presenters for their discussions, Karim Yaghmour for organizing and running the conference, and Rom Lemarchand for helping get so many of the Google Android team to attend.]
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page:
Distributions>>
