Leading items

Welcome to the LWN.net Weekly Edition for January 10, 2019

This edition contains the following feature content:

What should be in the Python standard library?: how many batteries does the Python core development community want to include?
A new free-software forge: sr.ht: a look at an email-friendly software forge.
The rest of the 5.0 merge window: some significant changes at the end of this merge window, including a user-visible change to mincore() to address a security issue.
A setback for fs-verity: deep disagreements over how this feature should be implemented and presented to users.
Pressure stall monitors: how to make pressure stall information useful for handsets.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

What should be in the Python standard library?

By Jake Edge
January 9, 2019

Python has always touted itself as a "batteries included" language; its standard library contains lots of useful modules, often more than enough to solve many types of problems quickly. From time to time, though, some have started to rethink that philosophy, to reduce or restructure the standard library, for a variety of reasons. A discussion at the end of November on the python-dev mailing list revived that debate to some extent.

Jonathan Underwood raised the issue, likely unknowingly, when he asked about possibly adding some LZ4 compression library bindings to the standard library. As the project page indicates, it fits in well with the other compression modules already in the standard library. Responses were generally favorable or neutral, though some, like Brett Cannon, wondered if it made sense to broaden the scope a bit to create something similar to hashlib but for compression algorithms. Gregory P. Smith had a different take, however:

I don't think adding lz4 to the stdlib is worthwhile. It isn't required for core functionality as zlib is (lowest common denominator zip support). I'd argue that bz2 doesn't even belong in the stdlib, but we shouldn't go removing things. PyPI makes getting more algorithms easy.

If anything, it'd be nice to standardize on some stdlib namespaces that others could plug their modules into. Create a compress in the stdlib with zlib and bz2 in it, and a way for extension modules to add themselves in a managed manner instead of requiring a top level name? Opening up a designated namespace to third party modules is not something we've done as a project in the past though. It requires care. I haven't thought that through.

Steven D'Aprano objected to Smith's assertion about the Python Package Index (PyPI): "PyPI makes getting more algorithms easy for *SOME* people." He noted that in many environments (e.g. schools, companies) users cannot install additional software on the computers they are using, so PyPI is not the panacea it is sometimes characterized as.

That led Cannon to suggest discussing the standard library and its role: "We have never really had a discussion about how we want to guide the stdlib going forward (e.g. how much does PyPI influence things, focus/theme, etc.)." Paul Moore wasn't sure that discussing the matter would really resolve anything, though:

I'm not sure a formal discussion on this matter will help much - my feeling is that most people have relatively fixed views on how they would like things to go (large stdlib/batteries included vs external modules/PyPI/slim stdlib). The "problem" isn't so much with people having different views (as a group, we're pretty good at achieving workable compromises in the face of differing views) as it is about people forgetting that their experience isn't the only reality, which causes unnecessary frustration in discussions. That's more of a people problem than a technical one.

A larger standard library would help those without access to PyPI, Antoine Pitrou argued, while a smaller one does not provide huge benefits: "Python doesn't become magically faster or more powerful by including less in its standard distribution: the best it does is make the distribution slightly smaller." But there are definite downsides to having a large standard library, Benjamin Peterson said:

These include:

The [development] of stdlib modules slows to the rate of the Python release schedule.
stdlib modules become a permanent maintenance burden to CPython core developers.
The blessed status of stdlib modules means that users might use a substandard stdlib modules when a better thirdparty alternative exists.

Steve Dower would rather see a smaller standard library with some kind of "standard distribution" of PyPI modules that is curated by the core developers. Later in the thread, he listed numerous different Python distributions as examples of what he meant, but that just highlighted another problem, Moore said: which of those should he recommend to his users? Right now, the standard library provides the base that a Python script can rely on:

Every single one of those distributions includes the stdlib. If we remove the stdlib, what will end up as the lowest common denominator functionality that all Python scripts can assume? Obviously at least initially, inertia will mean the stdlib will still be present, but how long will it be before someone removes urllib in favour of the (better, but with an incompatible API) requests library? And how then can a "generic" Python script get a resource from the web?

Moore acknowledged that maintaining modules in the standard library has a "significant cost" but wondered if moving to the distribution model was simply shifting those costs to users—without users gaining much from it. Nathaniel Smith looked at the list of distributions and came to a different conclusion: the "single-box-of-batteries" model is not really solving the problems it needs to solve.

If Python core wants to be in the business of providing a single-box-of-batteries that solves Paul's problem, then we need to rethink how the stdlib works. Or, we could decide we want to leave that to the distros that are better at it, and focus on our core strengths like the language and interpreter. But if the stdlib isn't a single-box-of-batteries, then what is it?

It's really hard to tell whether specific packages would be good or bad additions to the stdlib, when we don't even know what the stdlib is supposed to be.

But Moore found that to be overstated somewhat. For him (and presumably others), the standard library is what you can expect to find when you have Python installed. That means that various things like StackOverflow answers, tutorials, books, and so on can rely upon those pieces being present, "much like you'd expect every Linux distribution to include grep". In addition, the "batteries included" attribute is likely to have been part of what helped Python grow into one of the most popular languages, D'Aprano said. "The current model for the stdlib seems to be working well, and we mess with it at our peril."

Nathaniel Smith sees some advantages to the "standard distribution" model, though he is not sure that it would really be the best option. "But what I like about it is that it could potentially reduce the conflict between what our different user groups need, instead of playing zero-sum tug-of-war every time this comes up." Others don't see it that way, though; "not every need can be solved by the stdlib", as Pitrou put it. He continued:

So, yes, there's a discussion for each concretely proposed package about whether it's sufficiently useful (and stable etc.) to be put in the stdlib. Every time it's a balancing act, and obviously it's an imperfect decision. That doesn't mean it cannot be done.

Moore concurred: "In exploring alternatives, let's not lose sight of the fact that the stdlib has been a huge success, so we know we *can* deliver an extremely successful distribution based on that model, no matter how much it might trigger regular debates :-)" In any case, as he pointed out, a more concrete proposal (in the form of a PEP) is going to be needed before any real progress can be made. Dower floated some ideas about what a distribution might look like along the way, but, without something like a PEP to discuss, participants are often talking past each other based on their assumptions.

The topic has come up before on the Python mailing lists and at Python Language Summits. In 2015, there was a discussion at the summit on adding the popular Requests module to the standard library. Participants recognized that there were significant barriers—development pace, certificate handling, no asyncio support—to moving it into the standard library. In the end, it made sense for Requests to stay out. At the 2018 summit, Christian Heimes brought up a number of batteries that should perhaps be removed from the set, though the effort to create a PEP listing them seems to have stalled.

No firm conclusions were drawn in the discussion, but part of the underlying problem seems to be a lack of clarity on what the purpose of the standard library is. At the 2015 summit, Cannon suggested an informational PEP be drafted to solidify that; until that happens, there will be wildly differing views on what role the standard library serves. At the moment, though, there is no process to accept or reject a PEP even if one were on offer; that will have to await the new Python Steering Council, which will be elected in early February. One of the first orders of business of that group is likely to address the PEP process.

As far as adding LZ4 goes, the overall feeling from the thread is that it would be useful to have it in the standard library—at least for those not looking to change the standard library model. Adding LZ4 also requires a PEP, however, so that process may be stalled by the governance change, as well.

Comments (20 posted)

A new free-software forge: sr.ht

By Jake Edge
January 8, 2019

Many projects have adopted the "GitHub style" of development over the last few years, though, of course, there are some high-profile exceptions that still use patches and mailing lists. Many projects are leery of putting all of their project metadata into a proprietary service, with limited means of usefully retrieving it should that be necessary, which is why GitLab (which is at least "open core") has been gaining some traction. A recently announced effort looks to kind of bridge the gap; Drew DeVault's sr.ht ("the hacker's forge") combines elements of both styles of development in a "100% free and open source software forge". It looks to be an ambitious project, but it may also suffer from a lack of "social network" effects, which is part of what sustains GitHub as the forge of choice today, it seems.

The announcement blog post is replete with superlatives about sr.ht, which is "pronounced 'sir hat', or any other way you want", but it is a bit unclear whether the project quite lives up to all of that. It combines many of the features seen at sites like GitHub and GitLab—Git hosting, bug tracking, continuous integration (CI), mailing list management, wikis—but does so in a way that "embraces and improves upon the email-based workflow favored by git itself, along with many of the more hacker-oriented projects around the net". The intent is that each of the separate services integrate well with both sr.ht and with the external ecosystem so that projects can use it piecemeal.

There are two sides to the sr.ht coin at this point; interested users can either host their own instance or use the hosted version. For now, the hosted version is free to use, since it is still "alpha", but eventually one will need to sign up for a plan, which range from $2 to $10 per month, to stay on the hosted service. There are instructions for getting sr.ht to run on other servers; it uses nginx, PostgreSQL, Redis, and Python 3 along with a mail server and a cron daemon.

While, overall, the documentation is rather terse and a bit scattered, it is not difficult to get started using the service by following the tutorial. Logging in allows one to create a Git repository; adding an SSH public key to the account then allows pushing an existing repository up to the system. From there, it can be browsed, as shown in the core sr.ht repository, cloned by others, and so on.

As mentioned, sr.ht has not taken the approach of being yet another GitHub clone. Instead, it is geared toward a mailing-list-centric approach, possibly using the sr.ht mailing list component. The sr.ht-dev mailing list (seen at right) provides an example of the user interface for that component. Unlike some other forges or mailing-list replacements, it is not JavaScript-heavy—in fact, sr.ht uses no JavaScript at all, so pages are small (less than 10KB on average) and load quickly.

There is a guide to the preferred development and collaboration style for sr.ht. It is based around git send-email to a mailing list with copies to potential reviewers, much like Linux kernel development is done. Instead of forking a repository on the server, as is done for GitHub and others, a local clone is made, changes are made and committed, then posted for review. Once a patch is ready for merging, maintainers can apply it using git am. As can be seen, this is much different than the web-centric "pull request" model used by GitHub and others.

Wikis for sr.ht can be created using the man component. Wikis are simply a Git repository of Markdown files that get converted to HTML and served when they are visited. In addition, any sr.ht Git repository can have a top-level README.md, which will be shown when the repository is browsed and could provide a link to a project-specific wiki.

The build and CI component, builds.sr.ht, is what DeVault calls "the flagship product from sr.ht". His announcement notes that he has been working with both Linux and non-Linux (e.g. BSD, Hurd) distributions to have them start using it because "it's the only platform which can scale to the automation needs of an entire Linux distribution". He also says that smaller users are switching away from Travis CI and Jenkins to builds.sr.ht.

On builds.sr.ht, simple YAML-based build manifests, similar to those you see on other platforms, are used to describe your builds. You can submit these through the web, the API, or various integrations. Within seconds, a virtual machine is booted with KVM, your build environment is sent to it, and your scripts start running. A diverse set of base images are supported on a variety of architectures, soon to include the first hardware-backed RISC-V cycles available to the general public. builds.sr.ht is used to automate everything from the deployment of this Jekyll-based blog, testing GitHub pull requests for sway, building and testing packages for postmarketOS, and deploying complex applications like builds.sr.ht itself. Our base images build, test, and deploy themselves every day.

The build manifests specify more than just how to build the project, "test" tasks can be specified as well. The manifests also specify the platform (e.g. Alpine Linux, FreeBSD) that should be used for the build and test tasks. Build manifests can be placed in particular locations (.build.yml, .builds/*.yml) in a Git repository in order to run them automatically when new code is pushed to the repository. More information about builds.sr.ht can be found in the tutorial, manual, and API reference.

There is also a bug/issue tracking component called "todo". Its user manual is particularly brief as of this writing ("TODO: write these docs"). There are other places one will run into missing documentation pages, perhaps most critically for the code review page that is referred to in the lists.sr.ht documentation for those new to mailing lists. One would guess those holes will be filled in before too long.

The project is written in Python 3 and licensed under the Affero GPLv3. As noted, it is an ambitious project, but one has to wonder whether the mailing-list-centric workflow will survive long term. The instructions on how to set up mail clients and descriptions of proper mailing-list etiquette may not sit well with newer developers. Email is painful to set up and use any more—many are finding alternatives far more attractive.

Ultimately, what a project like sr.ht needs in order to fill out its feature base, grow, and thrive is new projects. There is an existing stable of projects that are run in a way that is compatible with sr.ht, but not very many new projects are going that route—for good or ill. In addition, the social effects of GitHub (and, to a lesser extent, GitLab, at least in the free-software world) are a major piece of what makes that model so successful; it is hard to see sr.ht replicating that to any significant degree. It is an interesting project, though, and one that deserves well-wishes; for compatible projects looking for a home, it is certainly worth a look.

Comments (40 posted)

The rest of the 5.0 merge window

By Jonathan Corbet
January 7, 2019

Linus Torvalds released 5.0-rc1 on January 6, closing the merge window for this development cycle and confirming that the next release will indeed be called "5.0". At that point, 10,843 non-merge change sets had been pulled into the mainline, about 2,100 since last week's summary was written. Those 2,100 patches included a number of significant changes, though, including some new system-call semantics that may yet prove to create problems for existing user-space code.

The most significant changes merged in the last week include:

Architecture-specific

The C-SKY architecture has gained support for CPU hotplugging, ftrace, and perf.

Core kernel

There is a new "dynamic events" interface to the tracing subsystem. It unifies the three distinct interfaces (for kprobes, uprobes, and synthetic events) into a single control file. See this patch posting for a brief overview of how this interface works.

Hardware support

Miscellaneous: NVIDIA Tegra20 external memory controllers, Qualcomm PM8916 watchdog timers, TQ-Systems TQMX86 watchdog timers, MediaTek Command-Queue DMA controllers, UniPhier MIO DMA controllers, Raspberry Pi touchscreens, Amlogic Meson PCIe host controllers, and Socionext UniPhier PCIe controllers.
Pin control: NXP IMX8QXP pin controllers, Mediatek MT6797 and MT7629 pin controllers, Actions Semi S700 pin controllers, and Renesas RZ/A2 GPIO and pin controllers.
Support for high-resolution mouse scroll wheels has been significantly improved.

Security

A small piece of the secure-boot lockdown patch set has landed in the form of additional control over the kexec_load_file() system call. There is a new keyring (called .platform) for keys provided by the platform; it cannot be updated by a running system. Keys in this ring can be used to control which images may be run via kexec_load_file(). It has also become possible for security modules to prevent calls to kexec_load(), which cannot be verified in the same manner.
The secure computing (seccomp) mechanism can now defer policy decisions to user space. See this new documentation for details on the final version of the API.
The fscrypt filesystem encryption subsystem has gained support for the Adiantum encryption mode (which was added earlier in the merge window).
The semantics of the mincore() system call have changed; see below for details.

Internal kernel

The venerable access_ok() function, which verifies that an address lies within the user-space region, has lost its first argument. This argument was either VERIFY_READ or VERIFY_WRITE depending on the type of access, but no implementation of access_ok() actually used that information. The new prototype is:
```
    int access_ok(void *address, int len);
```
The patch implementing this change ended up modifying over 600 files. There have also been several follow-up patches fixing various issues created by this change.

Changing mincore()

The mincore() system call is used to determine which pages in a virtual address-space range are currently resident in the page cache; the idea is to allow an application to learn which of its pages can be accessed without incurring page faults. As Torvalds notes in this commit, the intended semantics of this call have always been "somewhat unclear", but its behavior all along has been to indicate which pages are resident in the cache, regardless of whether the calling process has ever tried to access those pages. In other words, mincore() would reveal the presence of pages faulted in by other processes running in the system.

Naturally, it turns out that if you can observe aspects of the system state that are the result of other process's activity, you can use that information to extract information that should be hidden. Daniel Gruss et al. have recently released a paper [PDF] showing how mincore() can be exploited in just this manner. In response, Jiri Kosina posted a patch allowing system administrators to turn mincore() into a privileged system call by way of a sysctl knob, but Torvalds wasn't pleased with that approach. He responded with a patch restricting the information returned by mincore() to anonymous pages and a small subset of file pages.

After Jann Horn pointed out that restricting the query to the calling process's page tables reduces the attack surface considerably, though, Torvalds decided to change his approach. As a result, the patch that was committed adds no new knobs, but does unconditionally restrict mincore() to pages that are actually mapped by the calling process — pages that said process has accessed at some point. That makes it much harder to use mincore() to observe what other processes are doing; as Torvalds pointed out, though, such observation is still theoretically possible, but harder.

So the easy attack is closed, but that additional security may come at the cost of creating problems for user space. As Torvalds noted in the changelog:

NOTE! This is a real semantic change, and it is for example known to change the output of "fincore", since that program literally does a mmap without populating it, and then doing "mincore()" on that mapping that doesn't actually have any pages in it.

I'm hoping that nobody actually has any workflow that cares, and the info leak is real.

If the change breaks code in the wild, it may have to be reverted and some other solution found; for this reason, this patch has not been marked for inclusion into the stable kernels. For those out there who have code that uses mincore(), now would be a good time to test the new semantics to ensure that things still work as expected.

A couple of significant things were not merged before the merge window closed, including the controversial fs-verity patch set. Also missing again is the new filesystem mounting API, though some of the precursor patches did go in toward the end of the merge window. Unless something surprising happens, the feature set for this cycle is complete and the 5.0 kernel is now in the stabilization phase, with a final release expected in late February.

Comments (2 posted)

A setback for fs-verity

By Jonathan Corbet
January 3, 2019

The fs-verity mechanism, created to protect files on Android devices from hostile modification by attackers, seemed to be on track for inclusion into the mainline kernel during the current merge window when the patch set was posted at the beginning of November. Indeed, it wasn't until mid-December that some other developers started to raise objections. The resulting conversation has revealed a deep difference of opinion regarding what makes a good filesystem-related API and may have implications for how similar features are implemented in the future.

The core idea behind fs-verity is the use of a Merkle tree to record a hash value associated with every block in a file. Whenever data from a protected file is read, the kernel first verifies the relevant block(s) against the hashes, and only allows the operation to proceed if there is a match. An attacker may find a way to change a critical file, but there is no way to change the Merkle tree after its creation, so any changes made would be immediately detected. In this way, it is hoped, Android systems can be protected against certain kinds of persistent malware attacks.

There is no opposition to the idea of adding functionality to the kernel to detect hostile modifications to files. It turns out, though, there there is indeed some opposition to how this functionality has been implemented in the current patch set. See the above-linked article and this documentation patch for details of how fs-verity is meant to work. In short, user space is responsible for the creation of the Merkle tree, which must be surrounded by header structures and carefully placed at the beginning of a block after the end of the file data. An ioctl() call tells the kernel that fs-verity is to be invoked on the file; after that, the location of the end of the file (from a user-space point of view) is changed to hide the Merkle tree from user space, and the file itself becomes read-only.

Christoph Hellwig was the first to oppose the work, less than two weeks before the opening of the merge window. The storage of the Merkle tree inline was, he said, "simply not acceptable" and the interface should not require a specific way of storing this data. He later suggested that the hash data should be passed separately to the ioctl() call, rather than being placed after the file data. Darrick Wong suggested a similar interface, noting that it would give the filesystem a lot of flexibility in terms of how the hash data would be stored.

Dave Chinner complained that storing the Merkle tree after the end of the file was incompatible with how some filesystems (XFS in particular) use that space. He described the approach as being "gross", arguing that it "bleeds implementation details all over the API" and creates problems far beyond the filesystems that actually implement fs-verity:

That's the problem here - fsverity completely redefines the layout of user data files for everyone, not just fsverity, and not just the filesystems that implement fsverity. You've taken an ext4 fsverity implementation feature and promoted it to being a linux-wide file data layout standard that is encoded into the kernel/user ABI/API forever more.

Chinner, too, argued that the Merkle-tree data should be provided separately to the kernel, rather than being stored in the file itself using a specific format. Filesystem implementations could still put the data after the end of the existing data, but that is a detail that should, according to Chinner be hidden from user space.

Eric Biggers, the developer of fs-verity, responded that, while the API requires user space to place the Merkle tree after the end of user data, there is no actual need for filesystems to keep it there:

As explained in the documentation, the core code uses the "metadata after EOF" format for the API, but not necessarily the on-disk format. I.e., FS_IOC_ENABLE_VERITY requires it, but during the ioctl the filesystem can choose to move the metadata into a different location, such as a file stream.

He also said that passing the Merkle tree in as a memory buffer is problematic, since it could be too large to fit into memory on a small system. (The size of this data also prevents it from being stored as an extended attribute as some have suggested.) Generating the hash data in the kernel was also considered, Biggers said, but it was concluded that this task was better handled in user space.

Ted Ts'o claimed repeatedly that there would be no value to be had by changing the API for creating protected files; he described the complaints as "really more of a philosophical objection than anything else". The requested API, he said, could be added later (in addition to the proposed API, which would have to be maintained indefinitely) if it turned out to be necessary. After the discussion continued for a while, he escalated the discussion to Linus Torvalds, asking for a decision:

Linus --- we're going round and round, and I don't think this is really a technical dispute at this point, but rather an aesthetics one. Will you be willing to accept my pull request for a feature which is being shipped on millions of Android phones, has been out for review for months, and for which, if we *really* need to add uselessly complicated interface later, we can do that?

Correction: I've been reminded that there was an extensive discussion of this work in early 2018 where many of the same objections were raised.

Complaining that the code had been out for review makes some sense; it is true that the objections surfaced at something close to the last minute. But that often happens in kernel development; the imminent merging of controversial code can concentrate developers' minds in that direction. Arguing that the API is already being shipped is definitely not a way to win favor. That notwithstanding, Ts'o had clearly hoped for a ruling from Torvalds that the current API was good enough and that the code could be merged.

What came back might well have failed to please anybody in the discussion, though. It turns out that Torvalds has no real objection to the model of storing the hash data at the end of the file itself:

So honestly, I personally *like* the model of "the file contains its own validation data" model. I think that's the right model, so that you can then basically just do "enable verification on this file, and verify that the root hash is this".

So that part I like. I think the people who argue for "let's have a separate interface that writes the merkle tree data" are completely wrong.

From there, though, he made it clear that he was not happy with the current implementation. This model, he said, should be independent of any specific filesystem, so it should be entirely implemented in the virtual filesystem layer. At that point, filesystems like XFS would never even see the fs-verity layer, so its implementation could not be a problem for them. A generic implementation would require no filesystem-specific code and would just work universally. He also disliked the trick that hides the Merkle tree after the fs-verity mode has been set; the validation data for the file should just be a part of the file itself, he said.

As Ts'o pointed out, keeping the hash data visible in the file would create confusion for higher-level software that has its own ideas about the format of any given file. He also provided some reasons for why he thinks filesystems need to be aware of fs-verity; they include ensuring that the right thing happens if a filesystem containing protected files is mounted by an older version of the filesystem code. Making fs-verity fully generic would, he said, have forced low-level API changes that would have affected "dozens of filesystems", a cost that he doesn't think is justified by the benefits.

The last message from Ts'o was sent on December 22; Torvalds has not responded to it. ~~There has not, however, been a pull request for fs-verity sent, and it is getting late in the merge window for such a thing to show up.~~ [Correction: a pull request was sent copied only to the linux-fscrypt mailing list; it has not received a response as of this writing.] It seems likely that fs-verity is going to have to skip this development cycle while the patches are reworked to address some of the objections that have been raised — those from Torvalds, at least. Even then, the work might be controversial; it is rare for the kernel to interpret the contents of files, rather than just serving as a container for them, and some developers are likely to dislike an implementation that depends on that sort of interpretation. But if Torvalds remains in favor of such an approach, it is likely to find its way into the kernel eventually.

Comments (50 posted)

Pressure stall monitors

By Jonathan Corbet
January 4, 2019

One of the useful features added during the 4.20 development cycle was the availability of pressure-stall information, which provides visibility into how resource-constrained the system is. Interest in using this information has spread beyond the data-center environment where it was first implemented, but it turns out that there some shortcomings in the current interface that affect other use cases. Suren Baghdasaryan has posted a patch set aimed at making pressure-stall information more useful for the Android use case — and, most likely, for many other use cases as well.

As a reminder, the idea behind the pressure-stall mechanism is to track the amount of time that processes are unable to execute because they are waiting for resources (for CPU time, memory, and I/O bandwidth in particular). For example, reading /proc/pressure/memory will yield output like:

    some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
    full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258

This output says that at least one process has been blocked waiting for memory 70.24% of the time over the last ten seconds, or 68.52% of the time over the last minute. In the last ten seconds, all processes have been stalled 57.59% of the time, indicating a system that is seriously short of memory. An orchestration system monitoring this system would see that over half the CPU time is going to waste because the demands on memory are too high; corrective action is probably indicated.

The Android runtime system also tries to manage the set of running processes to make the best use of the hardware while providing acceptable response times to the user. When memory gets tight, for example, background processes may be killed to ensure that the application the user is engaging with at the moment has the resources it needs to run quickly. The pressure-stall information has some obvious utility when it comes to this kind of automated resource management: it provides exactly the kind of information needed to determine whether the system's response time is being affected by a shortage of memory.

The problem, from the Android point of view, is that the information provided is too little and too late. The highest-resolution information available is aggregated over ten seconds; that is entirely adequate for most data-center settings, but it's far too slow for a device that is interacting directly with users. If it takes ten seconds to learn that the device is getting sluggish, the user is likely to be getting grumpy by the time any corrective action is taken. Such users might well conclude that they are better off not staring into their phone all day, and that would clearly be bad for the industry as a whole.

The answer to this problem is to extend the pressure-stall mechanism to allow for high-frequency monitoring of stall data. With the patch set applied, an interested application can open /proc/pressure/memory for write access, then write a line containing three pieces of information:

    type stall-trigger time-window

The type value is either some (indicating that information about any stalled process is wanted) or full (limiting the information to full-system stalls where no process can run). stall-trigger indicates (in microseconds) the stall time that will trigger an event, and time-window is the time period over which that stall time happens. So, for example, writing:

    full 100000 1000000

will cause the monitor to trigger when the system stalls for a minimum of 100ms over any 1s period. The minimum time-window is 500ms, while the maximum is 10s. The stall-trigger can also be expressed as a percentage value; "10%" asks for a stall time that is 10% of the given time window.

Having requested a stall notification, the application can then pass the file descriptor to poll(). An exceptional condition (POLLPRI) event will be returned whenever a notification is generated. A monitoring system can thus be notified within a half-second of the system starting to become unresponsive and act to address the situation. There can be multiple processes monitoring the same stall information with different triggers and time windows. As is the case with the current pressure stall information, the new mechanism is aware of control groups; opening the relevant files within a memory control-group hierarchy will provide information on the members of that group only.

The actual tracking of stall times has been kept simple to avoid adding to the load on the system. For each monitor, the accumulated stall time is checked ten times for each time window. If the current window is 50% past, the calculated stall value will be the time accumulated so far in this window, plus 50% of the total from the previous window. This mechanism assumes that the situation will not change hugely from one window to the next; the benefit is that it only has to store a single past value for each monitor. The monitoring is turned off entirely if no stall events are occurring, so its overhead should be zero on a lightly loaded system.

The end result, Baghdasaryan says, is good:

With these patches applied, Android can monitor for, and ward off, mounting memory shortages before they cause problems for the user. For example, using memory stall monitors in userspace low memory killer daemon (lmkd) we can detect mounting pressure and kill less important processes before [the] device becomes visibly sluggish.

The functionality provided by this patch set seems clearly worthwhile, but the code itself is going to need a bit of work yet. The biggest complaint came from Peter Zijlstra, who doesn't like the elimination of the "idle mode" that stops data collection entirely when little is going on. Keeping the collection running will prevent the system from going into its deepest idle states, which will not be good for power consumption. Some sort of solution to that problem will need to be found before this code can go upstream.

There were also some comments on the string-parsing code added by the patch set; it may be simplified by eliminating the percentage option described above. Beyond that, it seems clear that this is a welcome addition to the system's load-monitoring functionality. Chances are it will find its way upstream before too long. How long it will be stalled before finding its way into production handsets is rather less clear, of course.

Comments (1 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>