|
|
Subscribe / Log in / New account

Leading items

Welcome to the LWN.net Weekly Edition for May 3, 2018

This edition contains the following feature content, which is dominated by coverage of the Linux Storage, Filesystem, and Memory-Management Summit this week. LSFMM 2018 coverage is still in progress; watch the LSFMM 2018 page for updates.

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Containers and license compliance

By Jake Edge
May 2, 2018

LLW

Containers are, of course, all the rage these days; in fact, during his 2018 Legal and Licensing Workshop (LLW) talk, Dirk Hohndel said with a grin that he hears "containers may take off". But, while containers are easy to set up and use, license compliance for containers is "incredibly hard". He has been spending "way too much time" thinking about container compliance recently and, beyond the standard "let's go shopping" solution to hard problems, has come up with some ideas. Hohndel is a longtime member of the FOSS community who is now the chief open source officer at VMware—a company that ships some container images.

He said that he would be using Docker in his examples, but he is not picking on Docker, it is just a well-known container management system. His talk is targeting those that want to ship an actual container image, rather than simply a Dockerfile that a customer would build into an image. He has heard of some trying to avoid "distributing" free and open-source software that way, but is rather skeptical of that approach.

Docker "hello, world"

So he looked at the Docker equivalent of "hello, world"; he used Debian as the base and had it run the echo command for the string "Hello LLW2018". Running it in Docker gave the string as expected, but digging around under the hood was rather eye-opening. In order to make that run, the image contained 81 separate packages, "just to say 'hi'". It contains Bash, forty different libraries of various kinds including some for C++, and so on, he said. Beyond that, there is support for SELinux and audit, so the container must be "extremely secure in how it prints 'hello world'".

[Dirk Hohndel]

In reality, most containers are far more complex, of course. For example, it is fairly common for Dockerfiles to wget a binary of gosu ("Simple Go-based setuid+setgid+setgroups+exec") to install it. This is bad from a security perspective, but worse from a compliance perspective, Hohndel said.

People do "incredibly dumb stuff" in their Dockerfiles, including adding new repositories with higher priorities than the standard distribution repositories, then doing an update. That means the standard packages might be replaced with others from elsewhere. Once again, that is a security nightmare, but it may also mean that there is no source code available and/or that the license information is missing. This is not something he made up, he said, if you look at the Docker repositories, you will see this kind of thing all over; many will just copy their Dockerfiles from elsewhere.

Even the standard practices are somewhat questionable. Specifying "debian:stable" as the base could change what gets built between two runs. Updating to the latest packages (e.g. using "apt-get update") is good for the security of the system, but it means that you may get different package versions every time you rebuild. Information on versions can be extracted from the package database on most builds, though there are "pico containers" that remove that database in order to save space—making it impossible to know what is present in the image.

It gets worse

But it gets even worse, Hohndel said. Most people start with a Dockerfile they just find somewhere. If you look at the Dockerfile for Elasticsearch, for example, it installs gosu and uses the Dockerfile for OpenJDK 8, which in turn uses other Dockerfiles. One of those is for Debian "stretch", which also updates all of the packages.

There is a "rabbit hole" that you need to follow, Dockerfile to Dockerfile, to figure out what you are actually shipping. He has done a search of official Docker images and did not find a single one that follows compliance best practices. All of the Dockerfiles grab other Dockerfiles—on and on.

No one wants to hear about these problems, Hohndel said; he has tried. He is a big fan of free software, but not really a fan of enforcement; he would rather simply fix the problems. But in order to fix these problems, people have to understand and care about compliance. He has been to KubeCon, and will be again soon, trying to educate folks about these problems. At one of the talks, he asked how many copyleft packages were in a particular Docker image, but he just got blank stares.

In the container image for an uncomplicated three-tier application, he counted 650 packages. The problem is only getting worse, he said. It is "incredibly hard" to get compliance right if it is done at build time, but it is "pretty much impossible" to do after that point. It is important to get people to understand that the complexity of what they are shipping in containers is much greater than what a few simple commands might indicate.

The problems with container images are many. It is hard to figure out which packages are included in the build. The version and which patches are applied are also difficult to determine. Beyond that, the licenses under which those packages are distributed are not obvious. He has seen containers that try to save space by statically linking various pieces that may not be linkable based on their licenses.

The tooling that the industry has developed makes it quick and easy to throw together an image. But it also, "hopefully unintentionally", makes it easy to create a "total compliance nightmare", Hohndel said.

What should be done

Telling people to stop shipping containers is not going to work, so another approach is needed. Containers need to be built starting from a base that has known-good package versions, corresponding source code, and licenses. The anti-pattern of installing stuff from random internet locations needs to be avoided. And software developers need to be trained about the pitfalls of the container build systems, which should not be hard, but is.

Any layers that will be added on top of the base need to be tracked as well. The versions, source location, and licenses should all be stored and a source-code management system should be used to track the information over time. One way to do so is to annotate the Dockerfiles with the meta information about the packages, though creating these annotations is hard, he said.

VMware has started the Tern project to help automate the creation of a bill of materials (BOM) for a container image. It will determine what packages are present in the image from the Dockerfile, but it also understands some of the commands that are used in Dockerfiles to retrieve and install packages, so it can track those too. It is a work in progress, Hohndel said, but may be helpful for container compliance.

[I would like to thank the LLW Platinum sponsors, Intel, the Linux Foundation, and Red Hat, for their travel assistance support to Barcelona for the conference.]

Comments (18 posted)

Willy's memory-management to-do list

By Jonathan Corbet
April 30, 2018

LSFMM
Matthew "Willy" Wilcox has been doing a fair amount of work in the memory-management area recently. He showed up at the 2018 Linux Storage, Filesystem, and Memory-Management Summit with a list of discussion topics related to that work; it was enough to fill a plenary session with some spillover into the memory-management track the next day. Some of his topics were fairly straightforward; others look to be somewhat more involved.

He started the plenary session by noting that the "vm_fault_t controversy" turned out to be rather more involved than he had expected; he seemed to be referring to a disconnected series of patches (example) creating a new vm_fault_t typedef for page-fault handlers. He has been busy trying to run the resulting changes through the relevant maintainers, but it has been some work; he didn't realize, he said, that the filesystem developers would be so "belligerent" about wanting to see the full series — which doesn't exist yet. In any case, he said, this is a boring topic; the room seemed to agree, so he moved on.

He then put up an example of code performing a memory allocation, and pointed out that it contained several bugs, including a missing overflow check and a lack of type checking. Bugs like this are fairly common in the kernel. He proposes to handle that use case with a new helper called kvmalloc_struct(), and is looking for feedback. The room didn't seem to find this topic to be worth arguing about either; Ted Ts'o finally suggested that Wilcox should "paint it blue".

He then called for the addition of malloc() and free() to the kernel API. A call to malloc() would turn into a [Matthew Wilcox] kvmalloc() call with the GFP_KERNEL flags. His purpose is to make it easier for new developers to write drivers by providing something that is more similar to the normal C API. There did not seem to be a lot of support for this idea from the group, though.

If an application uses mmap() to map the same page four-billion times, the map count in the kernel will overflow, with all of the undesirable effects that come from a reference-count overflow. Getting to this point is not easy; one needs a machine with 30GB of RAM to be able to do it. He has posted a fix for the problem; it simply kills any process that has tried to map the same page more than 5,000 times. Andrew Morton suggested that the alternative is to just leak the page.

There are two ways to get huge pages in user space (hugetlbfs and transparent huge pages), and they use the page cache differently; Wilcox would like to unify them. Hugetlbfs indexes pages in multiples of 2MB, while transparent huge pages use a normal 4KB offset. He would like to make hugetlbfs less special by using 4KB offsets there too. The only problem is a big performance hit, since there are many more entries in the radix tree; that makes this approach unworkable. So a solution he intends to pursue instead is to change the transparent huge pages implementation to use the multi-size features of his XArray mechanism, making it more closely match hugetlbfs.

Then, he would like to enhance the page cache to allow the use of block sizes that are bigger than the system page size. He thinks it can be done without requiring higher-order allocations, which has been a sticking point in the past. In short, the memory-management subsystem would inform the filesystem when a page fault has occurred and ask the filesystem to take care of populating the page cache with the needed pages. The filesystem can do that with normal 4KB pages; better performance will be had if it attempts a larger allocation first.

Dave Chinner pointed out that there were working patches for larger block sizes in 2007; they used compound pages, and were not accepted due to the high-order allocation issues. We have been here before, he said, and know how it works. Have high-order allocations been fixed in the meantime? Wilcox answered that the difference this time around is the fallback path that is implemented within the filesystems. Chinner worried that this idea didn't sound reliable; in particular, there could be problems (as usual) with truncate(). Wilcox answered that much of the work could be done once in the virtual filesystem layer and, hopefully, made to work reliably.

He also briefly mentioned the idea of teaching the page cache about holes (ranges with no blocks allocated) in files. Currently those are represented by zero-filled pages in the cache if need be. Replacing those with "zero entries" could save a significant amount of memory; an actual page would only need to be allocated in the event of a write operation.

There was also a brief discussion of "PFN entries" in the page cache. Currently, page-cache entries include a pointer to the page structure representing the page in memory. That structure will point to a specific mapping (such as the file containing the page). If you want to share pages in memory that are shared on disk (in a couple reflinked files, for example), the same page will have different mappings depending on where it came from. In that case, putting that pointer in the cache is going to lead to trouble, so he proposes using the page-frame number instead. There would still be page structures backing the whole thing up, but there would be an extra level of indirection to access them.

Finally, he said that he would like to get rid of the GFP_NOFS allocation flag, which tells the system that it cannot call into filesystem code to free memory. Instead, "scoped allocation", which simply tracks when filesystem code is holding a lock (and thus cannot be called back into) should be used. XFS is the closest to having implemented scoped allocation, he said, but there are still places where GFP_NOFS is used. This work is not currently making good progress.

Ts'o said that some good documentation would help; he has been trying to push this work forward, but has run into some questions. Chinner warned that this work has to be done carefully; GFP_NOFS is often used to silence the lockdep checker rather than out of a real need to avoid filesystem calls. He suggested adding a GFP_NOLOCKDEP flag for that purpose. Meanwhile, these call sites are hard to identify, since they are almost never documented as such.

The plenary session came to an end at this point, but Wilcox had not yet run out of ideas to run by the development community.

Cleaning up struct page

The page structure is one of the most complicated in the kernel; the curious are encouraged to have a look at its definition. Each page structure tracks one page of physical memory; it is used differently depending on how the page itself is used. As a result of varying needs and of the need to keep the structure small [Matthew Wilcox] (current systems have millions of them), struct page has become a difficult-to-follow mixture of structures, unions, and #ifdefs. Few developers dare to try to make changes there.

Wilcox posted a set of diagrams showing how the various fields of struct page are used now. When a kernel subsystem allocates a page, he said, it also gets access to the page structure to keep track of it. But that access is not exclusive. The refcount field can go up and down at any time, even if the allocating subsystem thinks it has exclusive access to the page. If the page is mapped into user space, the mapping field will be use for reverse-mapping. Various flags have special meanings, and so on.

In the end, a lot of users simply don't bother trying to store information in struct page even though there is space available there; it simply looks too complicated, and it is not at all clear which fields are safe to modify. That is, he said, "a shame".

To make kernel development less shameful, he is proposing a reorganization of the page structure to make it more comprehensible. The fields that are safe to touch have been moved together, resulting in five contiguous words of available memory. The complex arrangement of structs and unions has been replaced with a single set of unions, each containing a set of structs or simple types.

There was some discussion about the details of specific fields, and it was established that drivers could safely use the mem_cgroup field. In general, though, everybody seemed to feel that the proposal was a major improvement that made struct page much easier to understand. Wilcox promised that a patch set making these changes would be forthcoming soon.

Comments (12 posted)

Repurposing page->mapping

By Jonathan Corbet
April 26, 2018

LSFMM
The page structure is one of the most complex in the kernel due to the need to cram the maximum amount of information into as little space as possible. Each field is so heavily overloaded that developers prefer to avoid making changes to struct page if they can avoid it. That didn't deter Jérôme Glisse from proposing a significant change during two plenary sessions at the 2018 Linux Storage, Filesystem, and Memory-Management Summit, though. There are some interesting benefits on offer, but getting there will not be a simple task.

The mapping field of struct page describes where the page came from. For page-cache pages, mapping points to the address_space structure identifying the file the page belongs to; [Jérôme Glisse] anonymous pages use mapping to map back to the anon_vma structure. For pages used by the kernel itself, the mapping field can be used by the slab allocator. Like everything else in this structure, mapping is a complicated field with a number of different interpretations depending on how the page is being used at any given time.

Glisse has his own designs on that field, but first he must find a way to eliminate its current uses. Most of the time, code that is working with a page structure has found that structure by way of a virtual memory area (VMA) or a file structure; in either case, the mapping information can be obtained from those structures. In the contexts where that information is available, there is no need to store it in the page structure itself; it can be replaced by changing interfaces to pass the mapping information down the call chain. Doing so allows him to eliminate most uses of mapping and use that space for other purposes.

In particular, he is looking at using that field to attach a structure for threads that are waiting on the page. Currently, waiting on specific pages is done with a set of 256 shared wait queues; replacing those queues would make the wakeup process faster in cases where the queues get long. In the normal case, when nobody is waiting on a page, the mapping field would point to a structure like:

    struct page_mapping {
        struct address_space *mapping;
	unsigned long flags;
    };

Essentially, this mechanism is adding a layer of indirection for access to the mapping information, and adding some flags for good measure. When somebody needs to wait on that page, though, this structure would be replaced with:

    struct page_wait {
        struct wait_queue_head wq[PAGE_WRITABLE_BITS];
	struct wait_queue_entry entry;
	struct page_mapping base;
	spinlock_t lock;
	bool own;
    };

When this substitution is done, the pointer to the page_wait structure has its lowest significant bit set to flag the change. Any code needing the old mapping field would need to notice that change and follow the pointers through another level of indirection to get that information. This situation would persist until the last waiter is removed; at that point, the pointer to the page_mapping structure would be restored.

Hugh Dickins jumped in to ask Glisse what problem he was trying to solve with this change. Glisse responded that the point was to address the length of the shared wait queues, which have been growing over time. Dickins said that the problem could be solved more simply by just adding more queues. Kent Overstreet said that perhaps rhashtables could be used for this purpose instead. He is fully in favor of eliminating the mapping field, he said, but it would be a shame to replace that field with something else. But Glisse said his goal is not to fully remove the field; he really just wants to make it easier to attach structures to the page structure.

Matthew Wilcox suggested that perhaps the private field could be used for this kind of structure attachment, but it seems that there are already too many other uses of that field. Chris Mason asserted forcefully (and humorously) that "private is mine!".

Dickins said that there could be some value in generalizing the mapping field, but that using it for waiting on pages in particular is "a bit peculiar". Glisse responded that this functionality comes "almost for free" with little code required. But Dickins insisted that mapping says a lot about the identity of any given page; if it is replaced with something else, the page loses that identity information. He described this mechanism as an "odd misuse" to add some transient information.

Rik van Riel asked about how synchronization between page-lookup and truncate operations is handled in this scheme. Truncation needs mapping, Glisse said, but it's one of only a few places in the kernel that do. Making things work is simply a matter of adding awareness of the new scheme in those places. Dave Chinner said that XFS checks for a null mapping value to handle truncation; that will break in the new scheme. Glisse suggested putting in a new helper, but Overstreet said that, instead, the time has come to find a better way to handle locking around truncate operations.

Dan Williams repeated the question of why this pointer was really needed; this time Glisse said that he needs it for page write protection, a topic slated to be discussed on the following day. He needs to set a pointer inside of struct page, and mapping happens to be the easiest one to grab.

Glisse concluded the session by acknowledging that most developers seemed to feel that this change "looks ugly". He will post the patches in the near future anyway, though, and see what the reaction is at that time.

Generic write protection

The group was not yet done with mapping, though; Glisse led another session on the topic in a plenary session the following day, where he delved deeper into the motivation for this work. There are a number of situations where nominally writable memory must be globally write-protected for a time. One possible use case is "kernel duplicated memory", where memory pages are duplicated across the system for performance. In a system with multiple GPUs, it can be worthwhile to have duplicate copies of input data in multiple pages so that each GPU can access that data with its full available bandwidth. Another is PCIe atomic transactions — 256-byte transactions that must wait for an acknowledgment from the controller, which can be a slow process. It can be made quite a bit faster, though, if the memory is write-protected on the host; the GPU (or other remote processor) can then do its work without using slow atomic operations.

Also, in general, he wanted to minimize the reliance on the mapping field and open up some space within the page structure for other uses.

Getting there requires putting together a comprehensive picture of where mapping is used now and which alternatives may exist. For example, file-related system calls use it, but they also are given the file (and thus the mapping) as a parameter, so they don't need to obtain it from the page structure. Similarly, memory-related system calls have access to the virtual memory area (VMA) being operated on; the mapping can [Jérôme Glisse] be found there. A bit trickier might be getting at the mapping information from the BIO structures used for block I/O. Glisse said that he expects the relationship between a BIO and the mapping to be unchanging, but a BIO can actually contain pages from multiple files at times. It looks like the problem can be solved, but it may involve storing the mapping information in the BIO structure directly.

Once most users of mapping have been redirected, that field can be replaced with a pointer to a new structure, adding a level of indirection. The kernel same-page merging (KSM) mechanism does that already, as it turns out. So the first step might be to have mapping point to a page_attachment structure (different from the page_mapping structure shown the day before) that would replace the KSM mechanism and make it more generic.

Boaz Harrosh repeated the question from the day before about why mapping is being targeted rather than private. Glisse responded that private is used in "funny ways"; it's not always easy to know what is going on. It is much easier to just remove the users of mapping.

Dickins agreed that mapping is the right field to target. But he suggested that, if it is used in so few places, why not just put the replacement structure in a stable location rather than attaching it to the page structure? The answer seems to be that it is never clear how long a particular structure will end up being attached to the mapping field. But Dickins insisted that it could make sense to have mapping point to an always-useful structure. He (again) said that mapping tracks the identity of a page, so it should be replaced by a structure that still handles that function.

When the anon_vma mechanism was added (to allow the mapping of anonymous pages back to the tasks that reference them), he continued, the use of the mapping field was "fudged" to accommodate it. KSM then fudged it some more. Developers have always wanted to avoid changing every filesystem in the kernel, so they have taken the easy route out. But, he said, if Glisse is willing to do the work to change this field entirely, he should get rid of the "peculiarity" surrounding it.

David Howells asked if, instead, the address_space structure could be eliminated entirely, with the necessary information being put back into the inode structure. Glisse said that he would like to see that happen, but that the prospect of doing it is daunting. His plan is to create the feature he wants first, and maybe look at bigger tasks like this later.

The path forward

Glisse laid out his plan for upstreaming this work; it is designed to maximize the confidence in the changes before depending on them. The first step would be to replace every reference to mapping with a call to a helper function. Then the low-level functions that use mapping will be modified to take that information as a parameter. This change will be done with a tool like Coccinelle, and the value of the parameter would be NULL at the outset; the code would continue to use the mapping field and would ignore the new parameter.

In a subsequent development cycle, filesystems would be changed to pass the mapping information down to where it's needed; the memory-management system calls would see similar changes. At some point in this process, the KSM mechanism would be converted to the more generic approach.

Once this work is complete, it should be possible to avoid using the mapping in almost all situations. But, for a couple of releases, the code would continue to use mapping while checking to ensure that it matches the value passed down the call stack. This should build confidence that the conversion has been done properly — or point out the places where it has not. Once that confidence reaches a sufficiently high level, the final step could be taken and the mapping field would no longer be used.

Harrosh said that there may well be places where the two mapping values do not match. That could be the result of bugs, or places where, for whatever reason, the real mapping is not the one stored in the page structure. But Glisse plans to put tests in the places where the mapping field is actually used and, thus, is expected to be valid.

Dickins worried that this work sounds like a large amount of churn; he said he would need to know more about the benefits that will result. It would, he said, require "a resounding 'yes'" from the filesystem developers. Mason said that making it easier to share pages between files is "a big deal", so this is an important feature for him. As the subsequent discussion showed, though, there are a number of challenges to be overcome before it becomes possible to share pages between files; that topic was eventually set aside as something for the filesystem developers to work out at some other time.

Dickins also suggested that, if the real objective is to globally write-protect pages, a new page flag might be a better solution to the problem. Williams, instead, said that the real problem is user space writing to pages when they need to be exclusively owned by a device like the GPU. This solution might be seen as surrendering to bad behavior from user space, when the right thing to do is to push back and say "don't do that". Glisse replied that changing user space is not possible; it may be a ten-year-old program using a library that has been updated to accelerate operations with the GPU. A solution to that problem, Williams said, is to require applications to migrate to a newer library if they want the newer performance options.

Josef Bacik said that the proposed mechanism solved a mapping-related problem that he had run into, so he would be happy to see it go in. Overstreet agreed that a number of use cases would benefit from the change. Dave Hansen said that the benefits could go beyond GPUs to other devices that have their own memory. Dickins said that he had no reservations about the objective, but he still wasn't sure that changing mapping was the right way to get there. But he got the sense that the filesystem people were glad to have a sucker who is willing to do this work, and that they would prefer that the memory-management people not obstruct it.

Wilcox noted that some of these problems have been solved before; SGI was doing binary text replication years ago. That prompted Johannes Weiner to observe that, if the group has begun trading SGI anecdotes, it must be time to wrap up the discussion.

Comments (3 posted)

Heterogeneous memory management and MMU notifiers

By Jonathan Corbet
April 27, 2018

LSFMM
Heterogeneous memory management (HMM) is a relatively new kernel subsystem that allows the system to manage peripherals (such as graphics processors) that have their own memory-management units. In two sessions during the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit, HMM creator Jérôme Glisse provided an update on the status of this subsystem and where it is going, along with a more detailed look at the memory-management unit (MMU) notifiers mechanism on which it depends.

An HMM update

Glisse started by noting that an RFC patch adding HMM support to the Nouveau driver (for NVIDIA graphics processors) has been posted, with a second version coming soon. He is hoping to convert more GPU drivers to HMM in the near future; it is a better solution, he said, than using get_user_pages() to pin user-space pages into memory. Beyond the advantage of not pinning pages, HMM can help improve GPU performance by allowing the CPU to create a complex data structure using pointers that are valid in both CPU and GPU space. The CPU can then pass this structure to the GPU without having to recreate it using GPU-space pointers.

There are also some vendors looking into using HMM to manage device-private memory; AMD is likely to use HMM with its next-generation hardware.

Glisse had a question for the group: he noticed that core dumps do not take the mmap_sem lock before walking through (and dumping) user-space memory. That can lead to surprises when HMM is in use, since GPU threads could be accessing this memory while it is being dumped. Is this the expected behavior? Hugh Dickins said that he had run across that behavior recently, and was "horrified" to see it. Kirill Shutemov, though, said that the expectation is that, by the time it comes to dumping core, there will be no other users of the address space. Process exit might have similar issues, he thought, but Glisse said that the MMU notifier calls will have caused any HMM devices to back off earlier in the exit path.

Dickins said that everybody hates mmap_sem and is looking forward to removing it. That leads to a tendency to avoid taking it in situations where it is thought that there will be no concurrent accesses anyway. But those are just the situations where there is no overhead to taking mmap_sem in the first place. Memory-management developers are just being silly by trying to avoid taking it, he said; it just causes surprises. That part of the discussion ended with Glisse saying he would post a patch adding mmap_sem to the core-dump path.

Glisse went on to say that he has a set of MMU-notifier patches pending. In the process of writing them, he has noticed that a number of users of get_user_pages() are broken. In particular, these users assume that pages are fixed in place when they can, in fact, be swapped out from underneath the device that is accessing them. Files that are being truncated are the most common case. If a file is truncated then extended again, a driver using get_user_pages() will continue using the old pages, while HMM drivers using MMU notifiers will do the right thing.

Matthew Wilcox responded that nobody expects DMA to pages that have been truncated to work, so there is no need to fix this behavior unless it is a security problem. Dan Williams agreed, saying that changing all of those drivers to fix something that isn't really a problem does not seem worthwhile. But Glisse said that each driver does things differently, and moving them to common code results in removing hundreds of lines from each driver, so he is likely to persist.

As time ran out, a couple of other topics were raised briefly. Glisse would like to see a new DMA API that is designed to share mappings between multiple I/O memory-management units. There are also some issues around migrating file-backed pages to a device. This activity must be coordinated with filesystems, and with the writeback activity in particular, since these pages will become inaccessible to the GPU while the device is working with them.

MMU notifiers

When the kernel makes changes to a process's address space, it is able to keep the system's MMU in sync with those changes at the same time. Things get trickier, though, if one or more peripheral devices also have MMUs that must be managed. The answer is the "MMU notifier" mechanism, which allows the HMM code to get a callback from the memory-management subsystem when important changes are made. But it turns out that some changes are more important than others, so Glisse would like to adjust the MMU-notifier API so that the HMM code can tell the difference.

An MMU notifier is called whenever one or more pages in the address space of interest are invalidated, but the notifier is not given any information about why this invalidation is happening. To rectify that, he suggested adding a new argument that would give the invalidation reason. If the memory is being unmapped entirely, for example, HMM will respond by freeing all of its data structures associated with the memory range. If the memory protections are being changed, instead, there are no changes to the physical pages, and thus no caching issues to deal with. If a page's physical address changes, mappings in the peripheral device must change as well. Some changes, such as clearing the soft-dirty bit, only affect the host and don't require any response from HMM at all. And so on.

In response to a question from Michal Hocko, Glisse said that the MMU notifiers only fill an advisory role at the moment; they provide information that the HMM subsystem can use. In the future, the notifiers might take significant actions, such as preempting the device, when notifications arrive.

Dave Hansen worried that adding complexity to the notifiers will make it harder for developers to know what to do when they make core memory-management changes. Under this scheme, it would become harder to know which argument to pass to notifiers in new code; the fear of breaking something in HMM would always be there. That would make the memory-management code harder to maintain. Glisse said that the safe option would be to indicate a full unmap when in doubt; it would lead to the worst performance, but the results would be correct.

Hansen suggested splitting the suggested event types into a lower-level event mask describing what is actually going on. One bit would indicate a physical address page, another that the virtual-memory area behind the page is being removed, another for software-state changes, etc. Glisse said that this approach would be workable, though it would provide more information than he really needs now. The conversation went in circles for some time on what the specific bits might be with no clear conclusion. It seems fairly clear that this patch will use the bitmask approach, though, when it is posted in the future.

As things wound down, Andrew Morton noted that there had been some disagreements in the past over whether MMU notifiers are allowed to sleep. He asked: has all of that been sorted out? Glisse responded that all notifiers are allowed to sleep at this point.

Comments (2 posted)

Exposing storage devices as memory

By Jonathan Corbet
April 27, 2018

LSFMM
Storage devices are in a period of extensive change. As they get faster and become byte-addressable by the CPU, they tend to look increasingly like ordinary memory. But they aren't memory, so it still isn't clear what the best model for accessing them should be. Adam Manzanares led a session during the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit, where his proposal of a new access mechanism ran into some skepticism.

As things stand now, he said, there are two ways of exposing a storage device as memory: adding it as a swap device, or allowing applications to directly map the device into their address space. But, as memory-like as these devices are, they still do not have the same access times as dynamic RAM and, despite improvements, they still do not have the same write endurance. So he is not sure that making them look like ordinary memory is the correct model.

[Adam Manzanares] He has a different idea: extend the virtual-memory idea to provide access to memory-like storage devices. Each device would have some ordinary RAM allocated for it; that memory would be used as a cache for data stored in the device itself. That device is represented in the kernel by a simple character device (/dev/umap) that would allow an application to map a portion of the storage device's space into its address space. The primary purpose of /dev/umap is to catch page faults; it can respond to them by moving data between the RAM cache and the storage device.

The developers in the room thought that this arrangement looked a lot like using the storage device as a swap area; Manzanares responded that he didn't want to use the existing swap mechanism because it depends on the block layer, which he thinks is expensive and wants to avoid. Just using the page cache also lacks appeal, he said; it is too memory-intensive and does not give applications enough control. As it is, he said, /dev/umap outperforms both the memory-mapping and swapping alternatives.

Dan Williams suggested that, rather than creating a completely new mechanism for accessing this memory, it would be better to start with the existing filesystem APIs and the page cache and, if necessary, trim them down. Dave Hansen said that the kernel has a hard time just managing the memory types it supports now; adding another is unlikely to make things better. The discussion circled around these points for a while but, as the session came to a close, it was clear that /dev/umap would have a hard time getting upstream in its current form.

Comments (3 posted)

Rethinking NUMA

By Jonathan Corbet
April 27, 2018

LSFMM
The non-uniform memory architecture (NUMA) was designed around the idea that there are two types of memory on complex systems: local (faster) and remote (slower). During the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit, Anshuman Khandual asserted that the situation has since become rather more complicated. Perhaps, he said, the time has come to rethink how we view NUMA systems.

On upcoming hardware, Khandual said, there are memory interfaces that can deal with numerous types of memory, all of which ends up looking like DRAM. Memory can vary in parameters like bandwidth, persistence, latency, and power consumption. [Anshuman
Khandual] Applications may want to take advantage of this diversity by, for example, putting an important but rarely accessed data structure in low-power memory, reserving the faster (and more power-hungry) memory for data that must be closer to hand. A lot of this kind of control can be achieved now with mmap(), but that approach leaves no room for integration with the memory-management subsystem. For example, there is no migration of pages between different memory types in response to the system workload. Things would work better if the kernel had a better understanding of memory attributes.

There is, he said, an existing solution using the current NUMA abstraction: create new nodes to hold slower memory and set the distance value accordingly. Applications can then map memory into those zones if they want, but pages should not end up there by default. Special memory should not be used like normal memory. Matthew Wilcox replied that this might not always be the case; if the alternative is to put the system into an out-of-memory panic, it might be better to allocate pages from a slow zone. That is what happens now, Khandual said; with sufficient memory pressure, pages will be placed in the special zones — but, depending on the nature of those zones, that might not be desirable.

Khandual suggested keeping more attribute information inside NUMA zones, perhaps tagging the memory with different zone or migration types. That would help to prevent implicit allocations in those zones. Beyond unwanted spillover, the simple fact is that node distance is not enough to capture the differences between different types of memory. For example, Jérôme Glisse said, a system with two GPUs may have a faster link between them; memory allocations on one GPU should fall back to the other if need be, but there is no way to express that in the kernel now.

If some way is found to encapsulate memory attributes into NUMA nodes, there still needs to be a way to get that information out to user space so applications can make use of it. There was talk of a new sysfs interface, but there were also worries that it could grow too large on a system with a lot of nodes. Perhaps what is needed, Khandual suggested, is a new API to request memory with specific attributes.

That suggestion concerned Dave Hansen, who said that this kind of API would require a lot of thought and is fraught with pitfalls. The original plans for NUMA support included a lot of options, but most of them turned out not to be needed in the real world. We are, he said, terrible at designing interfaces in general; there is no way that we would get it right when the hardware we are designing for is not even available yet. Instead, he said, the thing to do is to find the places where the current NUMA interface isn't working now, then build a case for small additions to the API when they make sense. But, to the extent that it is possible, it would be better to rely on the existing APIs for now.

The session concluded with a warning to Khandual that, in typical memory-management fashion, he would be invited to the next five annual LSFMM events to give reports on how the work is progressing.

Comments (2 posted)

The memory-management development process

By Jonathan Corbet
April 27, 2018

LSFMM
The memory-management subsystem is maintained by a small but dedicated group of developers. How healthy is that development community? Michal Hocko raised that question during the memory-management track at the 2018 Linux Storage, Filesystem, and Memory-Management Summit. Hocko is worried, but it appears that his concerns are not universally felt.

Hocko started by saying that he wanted a continuation of the development-process discussion held at the 2017 summit. He continues to be concerned about the amount of review that memory-management patches are receiving; by his count, about half of the patches being merged have not been seriously reviewed. He wondered whether the process is truly healthy, or whether it is putting too much load onto [Michal Hocko] Andrew Morton to review everything. Perhaps, he suggested, it is time to move to a more hierarchical maintenance scheme where developers would take responsibility for specific parts of the memory-management problem and free up Morton's time for more high-level concerns.

Morton, though, replied that the community is doing well overall. The review rate for memory-management patches in the last merge window was, he said, 100%. He did confess, though, that this rate was achieved only because he started adding Reviewed-by tags of his own for the first time. The group could, he said, require Reviewed-by tags on everything over the long term but, in his experience, the presence of that tag is not a reliable indicator that a quality review has actually been done.

In any case, he said, memory management is doing better than many other parts of the kernel.

In the 2017 discussion, Morton said, he had told the group that any developer should inform him if they want to review a patch but don't have the time to do it right away. He would then hold the patch until the review could be done. But absolutely nobody has taken him up on that offer. There are, he said, about 120 memory-management patches going into the mainline in each development cycle; that is not enough to be worth the trouble to split up among multiple maintainers. Memory-management is doing well because its developers are generalists; splitting things up could instead encourage the creation of silos.

Hocko complained about Morton's longstanding practice of accumulating fixes to patches rather than replacing broken patches with better alternatives. See, for example, the March 28 mmotm tree, which includes these patches:

    mm-initialize-pages-on-demand-during-boot.patch
    mm-initialize-pages-on-demand-during-boot-fix-3.patch
    mm-initialize-pages-on-demand-during-boot-fix-4.patch
    mm-initialize-pages-on-demand-during-boot-fix-4-fix.patch
    mm-initialize-pages-on-demand-during-boot-v5.patch
    mm-initialize-pages-on-demand-during-boot-v5-fix.patch
    mm-initialize-pages-on-demand-during-boot-v6.patch
    mm-initialize-pages-on-demand-during-boot-v6-checkpatch-fixes.patch

When faced with an accumulation of patches like that, Hocko said, it is easy to lose track of the overall picture. Morton replied that if he were using Git (as many have often suggested) those fixes would be there forever. The answer to that was that he could rebase things, or just wait until patches are in better shape before accepting them.

On that last point, Hocko said that there is usually no rush to get patches into the -mm tree, so it might not hurt to wait until patches improve. But Hugh Dickins said that acceptance into -mm is when patches really get tested. Morton said that he sees his role as being to help developers get patches into the mainline; if a patch has benefits, he feels he should bring it in, even if it's still shaky. He has never agreed with the "a maintainer's job is to say 'no'" philosophy. Dickins said that was useful, but there is also a need for somebody to say "no" sometimes. Morton is good, he said to laughter, while Hocko is evil (he was referring to this recent exchange on the list).

Returning to the "-fix" patches, Morton asked whether he should just merge them into the patches they are fixing. Johannes Weiner replied that it depends on the situation. For things like typo fixes, they might as well just be folded into the original. Complex fixes, though, should probably be kept separate.

Morton brought up the "HMM problem", referring to the heterogeneous memory-management patches which went through 24 revisions on the list without being seriously reviewed. Rik van Riel said that the problem is [Andrew Morton] that HMM was a new feature that nobody really understood, so nobody knew how to review it. Morton replied that, in the end, he merged it as "an act of faith"; it seems to be working so far.

Hocko said that the lack of a Git tree for memory management makes life harder. It would be good, he said, to have a tree that contains the work destined for the next merge window. Morton said that he could do that if it would help. Hocko said that it would provide a stable base for developers to work against; at the moment, they tend to base memory-management patches on any of a number of random trees. Having a Git tree would also make it easier to add submaintainers, should the group decide to do so. He repeated that this would be a good thing to do since it would take some of the load of off Morton, who replied that reviewing memory-management patches is what he's paid to do, and it's not overwhelming him. There are, he said, roughly ten major patches to review each week, which is not too heavy a burden.

Hocko insisted, though, that there are large features in the memory-management area that are not moving forward. There is also, he said, a lack of an overall vision for the subsystem. He cited memory hotplug as a particular problem area; "people just add hacks" and nobody actually wants to do anything with that code at this point if they can avoid it. But Morton repeated that memory management is doing well, with the code being stable overall and an improving review rate.

Dan Williams put out a request for more employers to support their developers spending time to review code by others. Dave Hansen asked how the group felt about Reviewed-by tags from people working for the same company as the submitter of the patch. That is something he tends not to do, since it "feels incestuous". Matthew Wilcox said that each developer needs to come to their own conclusion on how much weight to give such reviews; their quality can vary considerably. Hocko said he tends to be skeptical of first-post patches with five reviews from people he has never heard of before.

As this session (the last one on Monday) wound down and beer was beckoning, Laura Abbott asked whether memory management could benefit from a patchwork instance; van Riel said that it could help developers find patches to review. Hocko repeated that adoption of a Git tree would help with a lot of things in general. Whether these things will happen, or whether they will simply return as topics in the 2019 LSFMM Summit, remains to be seen.

Comments (2 posted)

The trouble with get_user_pages()

By Jonathan Corbet
April 30, 2018

LSFMM
When kernel code needs to work directly with user-space pages, it often calls get_user_pages() (or one of several variants) to fault those pages into RAM and pin them there. This function is not entirely easy to use, though, and recent changes have made it harder to use safely. Jan Kara and Dan Williams led a plenary session at the 2018 Linux Storage, Filesystem, and Memory-Management Summit to discuss potential solutions, but it is not entirely clear that any were found.

Kara started by saying that he just spent half a year chasing down reports of kernel crashes; now that he has found the reason, he's not sure what to do about it. It comes down to how get_user_pages() is used. When it is called, it will translate user-space virtual addresses to physical addresses and ensure that the pages are in memory. Typically the caller will then perform some sort of I/O on those pages. There are a number of [Jan
Kara] mechanisms by which this is done, but it all comes down to passing the addresses of the pages to the devices. When the I/O is complete, the kernel calls set_page_dirty() to mark the pages as dirty and releases its references to the pages.

Problems can arise when the kernel decides to perform writeback on some of the pages brought in with get_user_pages(). The writeback process will write-protect the pages so that user-space cannot modify them until writeback is complete, but it knows nothing about DMA operations started by the driver that called get_user_pages(); that I/O may still be ongoing. One failure mode comes about as the result of the filesystem not knowing that pages are changing underneath it; that can lead to crashes in the filesystem code.

Other crashes can come about if page reclaim removes buffers from the pages before the driver marks them dirty. Problems can result from modification of the data contained in pages while they are under writeback; it is essentially the stable pages problem all over again. And there are various data loss or corruption problems associated with use of fallocate() on pages that are under I/O — fallocate() may want to shuffle pages around, but an ongoing DMA operation will do the wrong thing if that happens.

Things get even worse if DAX is in use, since the pages in question exist on the storage media itself. If, for example, pages are truncated from a file before DMA completes, the result can be data and metadata corruption. Running DMA directly against blocks that the filesystem is manipulating is hazardous; the filesystem cannot see the elevated reference counts that would indicate that something else is going on with those pages.

Boaz Harrosh suggested simply preventing writeback on pages with elevated reference counts, but that would be likely to create all kinds of strange side effects. The fact that subsystems like RDMA can hold references on [Dan Williams] pages for hours at a time exacerbates this kind of problem. (The group circled for a while on the topic of whether this kind of long-term reference makes sense, without any sort of useful outcome).

Williams said that the core of the problem is finding a way to allow the kernel to work with pages that have been pinned with get_user_pages(). He proposed a set of changes, starting with storing information about pinned pages in the inode (Al Viro was quick to ask: "which inode?") and requiring get_user_pages() users to provide a revoke() callback. Jérôme Glisse insisted, though, that any call site that could implement revoke() could also just use MMU notifiers to detect changes. Williams said that revoke() would really just wait for the I/O to complete so that the pages could be released, but Glisse pointed out that, with various types of I/O (such as a camera device streaming video images) the I/O is never really done. There would be no avoiding taking action to stop I/O in such cases.

Going further, Glisse stated that MMU notifiers are the interface that the kernel has now for dealing with memory-management events. They are called for all page-table entry changes, including write protection; they should be used, he said, rather than reinventing the interface somewhere else. Kara acknowledged that the idea sounds interesting for short-term users of get_user_pages(), at least. As the session ran out of time, Glisse said that long-term users could make it work too; the Mellanox RDMA driver "did it right", for example. Of course, he acknowledged, the fact that this interface has its own memory-management unit helps. The kernel should, he said, "be mean" to hardware that lacks such capabilities.

About the only hard conclusion from this discussion was that more discussions are needed before the developers will get a real handle on this problem.

Comments (1 posted)

The LRU lock and mmap_sem

By Jonathan Corbet
April 30, 2018

LSFMM
The kernel's memory-management subsystem has to manage a great deal of concurrency; that leads to an ongoing series of locking challenges that sometimes seem intractable. Two recurring locking issues — the LRU locks and the mmap_sem lock — were the topic of sessions held during the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit. In both cases, it quickly became clear that, while some interesting ideas are being pursued, easy solutions are not on offer.

Too-frequently used LRU locks

The kernel maintains a set of least-recently-used (LRU) lists to track both anonymous and page-cache pages; when the time comes to reclaim some memory for other uses, the pages that have been idle the longest are the first to go, since they are, with luck, the pages that are least likely to be needed in the near future. The LRU lists (which exist for each NUMA node) are dynamic, with pages being added, removed, or reordered frequently. Daniel Jordan started his session by noting that Oracle has been running into problems with contention for the LRU locks that serialize access to the lists; about 1% of query time, he said, is spent waiting on those locks.

The problem, he said, is that the LRU lock is "a big hammer" controlling access to an entire LRU list. There should be no need for such a hammer; [Daniel Jordan] multiple threads should be able to operate on different parts of an LRU list concurrently. Getting there, he said, requires moving to a special type of per-page lock that uses the list structure itself.

Under his proposed scheme, the first step to remove a page from an LRU list is to put a special value into the "next" pointer of the previous page on the list. A compare-and-swap (CAS) operation would be used to change this pointer, making it possible to detect contention with another thread trying to make a change at the same place at the same time. The page being removed would also have its "next" pointer changed to a sentinel value as well.

At this point, any other thread traversing the LRU list will, when it hits the sentinel value, know that things are being changed; it will then spin on the pointer until it returns to a normal value. With traversals and concurrent changes blocked, the page of interest can now be removed from the LRU. The "next" pointer in the previous page, which remains on the list, can now be set to the page that followed the removed page, removing the lock and re-enabling concurrent operations.

A similar algorithm can be used for adding pages to the list. The list head itself can also be set to the sentinel value when pages are added to the front of the list.

Jordan said that contention is almost never encountered when using this algorithm, so the problems with the LRU lock essentially go away. Johannes Weiner suggested that it could be reduced further during addition operations by searching for an uncontended point rather than spinning; the exact position of new pages in the list isn't particularly important. Andrew Morton said that this algorithm could prove to be useful for a number of busy lists in the kernel.

Dave Hansen, instead, said that this idea was "cool", but that the real contention problem is the zone lock, which should be dealt with first. He noted that Aaron Lu has done some work in this area, and suggested that this "CAS trick" could perhaps work there as well. Hugh Dickins said that the algorithm is more interesting as a general approach to list manipulation than as a specific solution to LRU-lock contention, which isn't a problem for everybody. The session then wound down with a brief discussion of perhaps increasing batching in page management as another way of reducing contention.

mmap_sem

The mmap_sem semaphore is used to control access to a process's address space — and to a variety of other, related data structures. It has long been a contention point in the memory-management subsystem, but it has proved resistant to change. Laurent Dufour's session, held immediately after the LRU-lock discussion, started with a complaint that mmap_sem is the source of a great deal of contention on large systems that are running a lot of threaded applications. Can something be done about that?

The place to start, he said, is figuring out just what mmap_sem protects. That is not an easy answer to find. It covers access to many fields in the mm_struct structure. It is also used for the virtual memory area (VMA) red-black tree, the process VMA list, and various fields within the VMA structure itself. But that is just a beginning, he said; a serious audit will be needed to find the rest.

What are the options for reducing mmap_sem contention? One is speculative page-fault handling, an area Dufour has been working on for a while. It allows the handling of page faults, in many cases, without the need to grab mmap_sem at all. [Laerant Dufour] Breaking up mmap_sem into finer-grained locks is possibly interesting he said. A variant of that is range locking, which would support locking a portion of the address space rather than the whole thing; range locking won't solve all of the problems, though. There may be places where SRCU could be used to reduce contention. Finally, he noted that splitting and merging of VMAs is a contention point that could perhaps be resolved by deferring the merging of VMAs.

Hansen said that there were a lot of solutions in this list, but asked for a list of the problems being solved. He noted that even read access to mmap_sem hurts, since it bounces the reader count between processor caches. He suggested picking one specific problem and working on a solution.

Michal Hocko said that the real problem is applications "pretending they can have thousands of threads and it will still work". He, too, suggested prioritizing problems and picking the one that seems most important. Speculative page faults are nice, he said, but the patch also adds a lot of complexity to the page-fault path — which is already complex. There has been a lack of use cases that would show a real benefit from range locking, so that work has been stalled for a while. Smaller steps, he said, would be a better way to go.

There was a side discussion on range locking and how to pick the proper range to lock. It was pointed out that, for many operations, locking the range covered by a specific VMA would not be enough; it would also be necessary to lock one page on either side of the VMA to prevent concurrent merging. It is not clear that range locking is worth the effort, especially since many applications consist of one large VMA containing the bulk of the address space.

One area of contention is the mprotect() system call, which can cause a lot of splitting and merging of VMAs. Applications could reduce that contention in many cases by using memory protection keys instead. It was also suggested that the kernel could just avoid merging VMAs after mprotect() calls. Memory-management developers have long wanted to minimize the number of VMAs, but perhaps that doesn't really matter.

Hocko brought the session to a close with the conclusion that there is a lot of work to do in this area, and that no "single bullet" exists to solve the problem. He suggested getting rid of the worst abuses of mmap_sem as a starting point; there are /proc interfaces that use it, for example. Once that has been done, maybe the search could begin for a sane range-locking approach. Hansen said that the lock itself is often not the problem, and that the place to start is by better documenting how mmap_sem is actually used.

Comments (4 posted)

Three sessions on memory control groups

By Jonathan Corbet
May 1, 2018

LSFMM
Memory control groups allow the system administrator to impose memory-use limits on the members of control groups. In many ways, these limits behave like the overall limit on available memory, but there are also some differences. The behavior of the memory controller also changed with the advent of the version-2 control-group API, creating problems for at least one significant user. Three sessions held in the memory-management track of the Linux Storage, Filesystem, and Memory-Management Summit explored some of these problems.

Background reclaim

Yang Shi ran a session to discuss one of those differences: the lack of background reclaim inside control groups. He started by noting that there are whole classes of applications that do not respond well to latency spikes; high-speed trading is one such area. But a latency spike is exactly what happens whenever a process is forced into direct reclaim, which is a way of making the process do some of the work to free up memory to satisfy its own allocation requests. The system as a whole uses a kernel thread (kswapd) to perform reclaim in the background, but no such mechanism exists for control groups. That means that processes running under the memory controller can be made to perform direct reclaim when the control group approaches its limit, even if the system as a whole is not running short of memory.

Shi's question was: why not have background reclaim for memory control groups as well? Michal Hocko responded that this idea has been considered in the past. A system can have a lot of control groups, though, which would lead to a lot of kernel threads running to perform this reclaim. Those threads could end up eating a lot of CPU cycles which, in turn, could [Yang Shi] enable one control group to steal time from others, thus breaking the isolation between them. Fixing that problem would require putting the kernel thread inside the control group itself so that it could be throttled by the CPU controller, but that is not currently possible.

That notwithstanding, Shi has been working on a solution based on a patch posted by Ying Han back in 2011. It allows a set of memory watermarks to be applied to a control group; when the watermarks are set, a kernel thread will be created to enforce them. The current implementation works by scanning the local control-group least-recently-used list; it does not currently support child groups. The result is working at Alibaba; the code can be found in this repository. It has not yet been cleaned up for merging into the mainline.

That cleaning up could take a while, because there are several problems with the current implementation. It creates kernel threads, which have already been noted to be a problem. The lack of hierarchical reclaim will be a sticking point, since the memory controller is otherwise fully hierarchical. The per-group kswapd thread can interfere with the global one. And the whole thing only works with the version-1 control-group ("cgroupv1") interface. All of these issues would need to be addressed before the patch could go upstream.

Rik van Riel suggested that workqueues could be used instead of kernel threads. The CPU-accounting issues would remain, but there would not be a lot of idle kernel threads sitting around. Dave Hansen noted that the original patch came out of Google and asked whether Google uses it now; Hugh Dickins responded that Google isn't using it now, but might start if the code found its way upstream. Johannes Weiner asked why the cgroupv2 API is not supported, since the reclaim code is the same for both; Shi responded that there is a lot of legacy code at Alibaba that prevents moving to cgroupv2.

Hansen asked if the kernel threads are really a problem, given that they are relatively lightweight. Hocko said that the main problem is the lack of CPU-usage accounting; Dickins said that is one of the reasons why Google doesn't use this mechanism. Weiner said that the accounting issues could probably be fixed, but only in the cgroupv2 API, which would ensure that the CPU and memory controllers are managing the same set of processes.

Andrew Morton questioned the need for a kernel thread; instead, the kernel could just fork a thread in the context of a process running in the group. This idea drew some interest, resulting in some parallel conversations on how it might be made to work. There are, evidently, patches inside Google for "threshold events" that could be used to trigger the launching of this reclaim thread. But Weiner said that the cgroupv1 memory controller had some of this functionality, and the result was a lot of complexity and run-time cost. It would be more straightforward, he said, to just find a way to annotate a kernel thread.

As time ran out in the session, Shi moved on to a related problem: direct compaction. Beyond reclaim, processes can also be drafted to move pages around in memory for defragmentation purposes. It is expensive, Shi said, and there is no way to control when it is triggered. He suggested adding a per-process flag that would cause the kernel to skip direct compaction, even when there does not appear to be any other way to satisfy an allocation request. Instead, an ENOMEM error would be returned and the kcompactd kernel thread would be kicked.

Weiner said that disabling direct compaction in this way is an invitation for visits from the out-of-memory killer. Hocko, instead, worried that returning ENOMEM in random places would tickle bugs in code that is not really prepared for allocation failures. In the end, the only real agreement was to continue talking about the problem in the future.

Swap accounting

Shakeel Butt ran two sessions the following day to cover issues that have come up with memory control groups at Google. The first of those is a change of behavior depending on which version of the control-group API is in use. In particular, there is a difference on how the limits on memory and swap use are set:

  • In cgroupv1, there is a single limit (memory.memsw.limit_in_bytes) that applies to the sum of RAM and swap usage by the group. Swapping a page in or out does not change a group's accounted usage under this limit.
  • In cgroupv2, there are two limits (memory.max and memory.swap.max) are accounted independently. Swapping a page out will decrease the measured memory usage and increase the swap usage.

This change was made because swap usage is seen as a fundamentally different resource requirement; in particular, swapping involves block I/O operations.

This change has created trouble for Google, though. A common situation there is to have multiple instances of the same job running in different data centers. Each center is run independently and is trying to maximize [Shakeel Butt] its productivity; as a result, one data center might run the job with swap enabled while another runs it without. Under cgroupv1, that job will have access to the same amount of memory in both centers, regardless of the availability of swap. Under cgroupv2, instead, only the memory limit applies and jobs are much more likely to end up facing the out-of-memory killer.

The advantage of the cgroupv1 interface, Butt said, is that the people submitting jobs don't need to know anything about what resources will be available when the jobs run. They will get consistent behavior whether swap is available or not. That is no longer true with cgroupv2; this problem is keeping Google from moving off of cgroupv1.

The memory-management developers were seemingly unconvinced that there is a real problem here, though. Weiner argued that memory and swap are not the same thing, so it does not make much sense to conflate them. Dave Hansen suggested just giving every group some swap space for free. Hansen and Weiner both pointed out that the separate controls for memory and swap give the administrator some control over the quality of service received by each group.

By the end of the session, it seemed unlikely that much is going to change. Dickins said that Google would probably keep the old behavior internally regardless of what happens with the mainline kernel. It is "peculiar from a cgroup point of view", he said, but the cgroupv1 behavior proves to be helpful in the real world.

OOM or ENOMEM?

Butt's other topic was behavior when memory runs out. If the system as a whole goes into an out-of-memory (OOM) state, the OOM killer will start killing processes in response to page faults or system calls that try to allocate memory. If a memory control group hits its limits, OOM kills will still happen on page faults, but system calls will return an ENOMEM error instead. This behavior is rather inconsistent, he said.

Hocko admitted that this difference in behavior was not an explicit design decision; instead, it is a workaround to prevent lockups. The OOM killer is able to delegate the task of killing processes to user space; this is a useful feature, but it can lead to deadlocks in the control-group setting, where it is highly likely that the process trying to allocate memory holds locks that will block the killing of the others. To avoid this problem, control-group OOM killing is only done in contexts that are known to be lock-free.

Since that decision was made and implemented, he said, the OOM reaper has been added to the kernel. The reaper is able to deprive an OOM-kill victim of most of its memory, even if the process itself is unable to exit because it is waiting on locks. So perhaps the kernel could move back toward consistent behavior in this case.

Alternatively, the kernel could defer the summoning of the OOM killer until the allocating process returns to user space. One potential problem there is "runaway allocations" — kernel code that loops allocating more and more memory without checking for fatal signals between allocations. Code like that simply needs to be fixed, of course. Meanwhile, things could be improved by letting the allocator dip into reserve memory when an OOM killer episode is on the horizon. Doing so risks breaking the isolation between groups, Weiner said, but it would help to avoid deadlocks.

The conclusion at the end is that the current behavior is clearly a bug that is in need of fixing; patches can be expected soon. There may also be patches instrumenting the kernel (for debug builds) to detect places where a series of allocations is performed without checking for signals.

Comments (3 posted)

The slab and protected-memory allocators

By Jonathan Corbet
May 1, 2018

LSFMM
One of the core jobs of the memory-management subsystem is to make memory available to other parts of the kernel when the need arises. The memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit hosted a pair of sessions on new or improved allocation functions for the kernel covering the slab allocators and protectable memory.

Slab allocators

The kernel's slab allocator is charged with allocating (usually) small chunks of memory for the rest of the kernel; it sits behind interfaces like kmalloc() and kmem_cache_alloc(). Vlastimil Babka led a session to discuss a couple of issues that have come up in the slab allocators. The first of those has to do with reclaimable slabs, which are used to allocate kernel objects that can be freed on request to defragment memory.

The kernel's dentry cache (which caches the results of filesystem lookups) is allocated from a reclaimable slab. But when a specific dentry refers to a particularly long name, that name won't fit into the dentry structure itself and must be allocated separately. That allocation, done with kmalloc(), is not directly reclaimable. In theory that is not a huge problem, since a call to the dentry shrinker will reclaim both pieces of [Vlastimil Babka] memory. But the kernel's accounting of how much memory is truly reclaimable is thrown off by this allocation pattern; the kernel thinks there is less reclaimable memory than there really is and goes needlessly into the out-of-memory state.

A solution can be found in this patch set from Roman Gushchin. It creates a new counter (nr_indirectly_reclaimable) to track the memory used by objects that can be freed by shrinking a different object. Babka is not entirely happy with the patch set, though. The name of the counter forces users to be concerned with "indirectly" reclaimable memory, which they shouldn't have to do. It is an ad hoc solution, he said, that should not become a part of the kernel ABI.

A better solution, he said, would be to make a separate set of reclaimable slabs for those kmalloc() calls. That would keep the reclaimable objects together, which would be better from a fragmentation point of view. Memory would be allocated from these slabs by providing the GFP_RECLAIMABLE flag. Babka asked the group whether he should pursue this idea, and got a positive response.

That leaves open the problem of this new counter, though, which has been merged for 4.17. Michal Hocko suggested simply reverting the patch; this accounting has been broken for years, and can stay that way a little longer. But others questioned whether it was really an ABI issue at all; Johannes Weiner said that counters have been removed before without ill effect.

Babka's other topic was the provision of slab caches for objects that are larger than one page, but whose size is not a power of two. Such objects are not handled efficiently now; a request to allocate a 640KB object, for example, will return a 1MB region, which is wasteful. The memory waste can be addressed by using alloc_pages_exact(), but that adds complexity and can cause memory fragmentation.

Instead, he suggested, it would be useful if the slab allocators could use larger (2MB) blocks of memory for these objects. That would reduce the amount of internal fragmentation considerably. It was generally agreed that this could be done, but there would need to be some changes to some of the heuristics that are used. Generally, code allocating these objects has a fallback path should the allocation fail, so the allocator itself should not fall back to smaller regions should the 2MB allocation fail. But the GFP_NORETRY flag can reduce the chances of that 2MB allocation succeeding in the first place, so that isn't the solution either.

As the session came to an end, Christoph Lameter pointed out that there is a min_order parameter that can be used to force the use of larger slabs now. It significantly increases performance, but it also applies to all slabs, which is probably not wanted on most systems. The solution would be to turn it into a per-slab parameter, he said.

Protectable memory

Igor Stoppa's protectable memory patch set was examined on LWN in March. He ran an LSFMM session to make the case for this functionality and to get some feedback from the developers. Protectable memory can be made read-only once it has been initialized, making it harder for an attacker to change it as part of a system compromise. It is meant to solve problems like those found on Android systems, where users install questionable apps, some of which may try to exploit a kernel vulnerability to change important data in the kernel.

The usual sequence for this sort of attack, he said, is to start by taking over an existing app, perhaps via a phishing attack. Then a kernel vulnerability is used to gain write access to some kernel data. The attacker must locate that data, which means defeating kernel address-space [Igor Stoppa] layout randomization, but there are usually leaks that can be used for that purpose. Once the attacker is able to make changes, the first order of business is to disable SELinux, after which it becomes possible to escalate to unconstrained root access in user space.

It is hard to close off all of the vulnerabilities that an attacker might try to use, Stoppa said. He can't fix user space or the various out-of-tree drivers that ship on such devices. But he might be able to prevent the disabling of SELinux. Most attacks on SELinux try to make changes to (or disable entirely) the policy database; making that data read-only should raise the bar considerably.

The existing mechanisms for creating read-only data in the kernel are not up to this task, though. Data can be marked read-only at the end of kernel initialization, but that is too soon for SELinux, since the policy must come from user space. It would be possible to use vmalloc() to allocate this database and change the page protections, but this approach would create a lot of fragmentation and TLB contention. So he created a new pmalloc() interface instead.

Dave Hansen asked for performance numbers relative to using vmalloc(). Stoppa does not have those numbers now; Hansen requested that he run some tests and provide them. Since this is a performance-oriented patch, the actual performance gain needs to be demonstrated.

A recent addition to pmalloc() is the "rare write" mechanism for cases when the data must be made writable again for a short period. That involves creating a new pool type for modifiable data. When modification happens, the new data is mapped into a different location, hopefully making it harder for an attacker to find.

Hugh Dickins asked about changing protected memory via the kernel's direct mapping. This mapping remains writable and, since it uses huge pages, it is hard to change the protections for individual (small) pages. Stoppa agreed that the direct mapping is a potential problem; it might be fixable on the x86 architecture, but not on ARM. Dickins responded that, in that case, one might as well just use kmalloc(). But Hansen disagreed, saying that it can be hard to find a specific object's location in the direct mapping, so there is some benefit to using pmalloc(). But he was unsure about how big that benefit is, and would like to hear what the security developers think.

This patch set has been through three rewrites so far. One problem is that these patches add the mechanism but not do not add any users of it, which makes merging harder. The problem here, Stoppa said, is that it is hard to find simple use cases. Getting the more complex users (such as SELinux) is hard without the API in the kernel, but getting the API merged is difficult without the users.

Comments (none posted)

Improving support for large, contiguous allocations

By Jonathan Corbet
May 1, 2018

LSFMM
Allocating chunks of memory that are both large and physically contiguous has long been a difficult thing to do in the kernel. But there are times where there is no alternative. Two sessions in the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit explored ways of making those allocations more reliable. It turns out that some use cases have a rather larger value of "large" than others.

The contiguous memory allocator

Allocating large, physically contiguous blocks of memory becomes increasingly difficult over the life of the system as memory gets more fragmented. Most kernel code goes out of its way to avoid such allocations, but there are places where they are necessary; for such situations, the contiguous memory allocator (CMA) can be used.

In her session on CMA, Fedora kernel maintainer Laura Abbott said that it relies on the kernel's "page block" mechanism, and it needs both reclaim and compaction to work [Laura Abbott] properly. That is because CMA counts on being able to move other allocations out of the way to create large blocks when the need arises. Page blocks are a relatively small management unit in most kernels, being the same size as the smallest huge page by default. Some trouble comes up on ARM systems, though, when 64KB pages are in use; at that page size, a page block holds 512MB. CMA requires regions to be page-block aligned, but creating a 512MB block for CMA on a memory-constrained system like a Raspberry Pi does not work well.

The current workarounds are to either do without CMA, or to use 4KB pages, neither of which is entirely appealing. Using smaller pages, in particular, would require Fedora to ship two kernels using different page sizes, and distributors will go far out of their way to avoid supporting multiple kernels if at all possible. Matthew Wilcox said that it might be interesting to add the ability for the kernel to automatically choose the page size when it boots, but that would not be a straightforward thing to implement.

One possible solution is to tie CMA to the ZONE_MOVABLE zone, which also would solve a number of accounting problems. CMA would still be bound by the page-block size, though. There was some talk of maybe reducing the page-block size, which would address this problem, but it might come at the cost of support for huge pages. Nonetheless, changing the size of page blocks is easily supported in the kernel now, so it might be the cleanest path to a short-term solution. At the end of the session, though, Michal Hocko noted that the page-block size is an arbitrary software construct; if page-block alignment requirements are causing trouble, "we've done it to ourselves", he said.

Going larger

The following session, led by Mike Kravetz, made it clear that "large" is a relative term. While CMA tends to be used for allocations measured in megabytes, he is working with a Mellanox RDMA controller that performs best when given a 2GB physically contiguous buffer to work with. While CMA allocations are performed within the kernel, this buffer is allocated in user space and registered with the driver. Needless to say, there can be some challenges involved in making it possible for that allocation to succeed.

The hugetlbfs mechanism supports "gigantic page" allocations; it can provide 1GB pages on the x86 architecture. Given the tricks involved in making that work, Kravetz said, the code that does [Mike Kravetz] this in hugetlbfs probably shouldn't be there. But it is a useful technique, and other use cases, such as Intel's patch set that goes under the concise name of "resource director technology cache pseudo locking", are starting to pop up as well.

Kravetz has been looking into alternatives. One of those was to add a MAP_CONTIG option to the mmap() system call that would cause the new mapping to be populated with physically contiguous pages. That turns out to give user space too much control over how allocations are made, though.

So he is looking instead at adding a new allocation function called find_alloc_contig_pages(). This function, meant to be called from within the driver, replicates much of the behavior found in the hugetlbfs allocator. It requires that the system support page migration so that the huge pages can be cleared if needed. The memory it allocates is in the movable zone, which may well be a problem for buffers meant to be used with peripheral devices.

The actual use case for this functionality lies in the Oracle database, which needs to register global areas for RDMA transfers. For hardware performance, a 2GB physically contiguous area is preferred. The CPU, though, needs to see this area as 2MB huge pages, which perform better than 1GB gigantic pages on current hardware. This memory could be allocated with hugetlbfs, Kravetz said, or the kernel could just allocate contiguous areas at boot.

He was seemingly hoping for some guidance from the developers on which of these approaches might be the least ugly. The session ran out of time, though, without a definitive conclusion on the best way to obtain this kind of memory allocation.

Comments (3 posted)

Toward better performance on large-memory systems

By Jonathan Corbet
May 2, 2018

LSFMM
Christoph Lameter works in a different computing environment than most of us; he supports high-volume trading applications that need every bit of performance that the fastest hardware can give them. Even then, it seems that isn't fast enough. In a memory-management-track session at the 2018 Linux Storage, Filesystem, and Memory-Management Summit, Lameter described some of the problems he has encountered and approaches he is considering to address them.

He is working with a large system that has four 10Gb/s network interfaces on it. Those interfaces run at full speed during bursts of activity, pouring data into the page cache, which sits in 1TB of main memory. That data is then archived onto a 100TB storage array. This system is vital for his company's operations; it must be possible to answer questions from regulators on exactly when any given packet arrived. It is important, he said, that the system works.

Unfortunately, he is having problems writing data to the archive; the maximum data rates from the page cache are 4-5GB/s. The only way the system can keep up is to remove some of the network interfaces. A much higher data rate (around 10GB/s) could be obtained by using direct I/O and avoiding the page cache altogether, but he would rather not do that. There are also a number of analysis processes running on this system, and they benefit from having the data in the page cache. So he's not sure what to do.

The problem is going to get worse; the company is upgrading to 100Gb/s interfaces and wants to achieve rates of 40GB/s to the disk. Lameter thinks that these performance issues are specific to Intel hardware; everything [Christoph
Lameter] works as desired on POWER8 processors. The root of the problem is the 4KB system page size, which is quite small on such a system. Unfortunately, filesystems and block devices in Linux do not understand huge pages.

He is considering trying to increase the base page size above 4KB. This is not a new idea; developers have been trying to make this work for years. Hugh Dickins noted that William Lee Irwin's page clustering patches worked fifteen years ago, but they proved to be too complex to be merged. A larger page size can also break some applications that expect to be able to map pages on a 4KB granularity.

Lameter has done some work on the "order N page cache", which would be able to store pages of any order. But he never got around to implementing mmap(), he said, and the patch had too many fragmentation problems to be merged. An alternative is transparent huge pages in the page cache. The problem with that is that no filesystem (other than tmpfs) supports it currently. He has looked at (ab)using the DAX mechanism to get huge-page mappings in ordinary memory; one could also use DAX for real on nonvolatile memory.

Returning to the 4KB page-size issue, Lameter pointed out that, at that size, about 2% of the system's memory is taken up by page structures. A system with 4TB of RAM must manage one-billion page structures. There are systems supporting 20TB of nonvolatile RAM coming soon, he said; the overhead is becoming unsupportable. But Dickins said that he wasn't worried about a 2% space tax.

Almost any solution to Lameter's performance problems is going to require more reliable allocation of large, physically contiguous memory areas. He has been playing with an XArray cache for memory chunks that would allow them to be moved as needed. Simple slab allocator support has been implemented, but a lot of work is needed still; allocators need to provide callbacks to allow memory chunks to be relocated. Alternatives include reserving memory at boot for large allocations, the MAP_CONTIG option to mmap(), or Java-style garbage collection. "But we are all horrified" by that last idea, he said.

Rik van Riel suggested working with the filesystem developers to support larger block sizes. But some filesystems (XFS, for example) already have that support; the real problem is in the page cache. Dickins suggested creating a mechanism like transparent huge pages that would opportunistically allocate larger chunks for the page cache; those chunks would be smaller than typical huge pages, though.

From that point on, the conversation went around in circles about whether these performance issues are truly a hardware problem or not. There was no useful outcome from that discussion, though, and it petered out as beer time approached.

Comments (3 posted)

File-level integrity

By Jake Edge
April 27, 2018

LSFMM

At the 2018 Linux Storage, Filesystem, and Memory Management Summit, Ted Ts'o introduced an integrity feature akin to dm-verity that targets Android, at least to start with. It is meant to protect the integrity of files on the system so that any tampering would be detectable. The initial use case would be for a certain special type of Android file, but other systems may find uses for it as well.

Android has a system partition that is read-only and protected with dm-verity. It must be completely rewritten in order to update it; after that, a reboot is required to start using the new data. Updating a filesystem protected with dm-verity is a heavyweight operation, Ts'o said. But Android has some system-level programs that need to be updated with some frequency, so there is the idea of privileged Android packages (APKs) that would not live in the system partition. These are somewhat like setuid-root binaries, Ts'o said, but Google wants them to be updated like any other app—in the background, possibly unnoticed by the user.

Normally, APKs have a signature that is checked once at download time and then never checked again. For the privileged APKs, Google wants to do the signature check on every use of the APK. That sounds like a job for the integrity measurement architecture (IMA), which targets file-level integrity and is already in the kernel, but its performance is not really up to the Android use case, Ts'o said. APKs can be large, with multiple translations and other pieces that are never used. Doing a checksum of the entire package before executing it will slow down the user experience and use more power; fs-verity will only check the pieces of the APK as they are actually needed.

The way it does that is by creating a file-level Merkle tree that has a cryptographic hash for each page-sized block of the file. The root of the tree will be signed; verifying hashes as the tree is traversed is then enough to ensure that those parts of the file have not been changed.

These APKs will be marked as immutable in the filesystem; they will need to be replaced whenever they are updated. The Merkle tree will be placed directly after the normal file data and the tree will be followed by a header that will store the size (i_size) of the original APK file. That will be used as the size of the file when it is accessed by the rest of the system.

Reporting a smaller i_size is perhaps the most controversial part of fs-verity, Ts'o said. For an immutable file, though, he doesn't think it will cause problems elsewhere. He considered using dm-verity with loopback mounts, but that would require all APK-handling code throughout the system to be updated, while fs-verity with the i_size switch allows the verification to be transparent to the rest of the system.

Bruce Fields asked about performance versus IMA, but Ts'o has not measured it; he believes it will be a big win for low-powered ARM devices, though. The key used by fs-verity would either be baked into the kernel directly or there would be a key-signing key in the kernel that would be used to verify the key used to sign the Merkle tree. It is the same basic model as for signed kernel modules, he said.

Jan Kara asked about accessing data beyond i_size, noting that calling mpage_getpages() will not work. Ts'o acknowledged that and said that fs-verity has its own scheme for reading pages past i_size and populating the page cache. Right now, there are "some hacks" that will need to be cleaned up before the code can go upstream.

Chris Mason wondered if that mechanism could be generalized to provide file streams (or forks) for Linux. Ts'o said that would mean that other filesystems beyond just ext4 would need to implement it. Mason argued that this feature is already adding stream support, just for a single, specialized type. But Dave Chinner noted that there is already some precedent for filesystems storing data beyond the reported i_size: XFS directories do so. Ts'o also pointed out that he is able to do a bunch of simplification in the code because these files are immutable.

The initial implementation of fs-verity is "going to be massively cheating", Ts'o said; the code will be found in the Android kernel repositories, but that is not the code that will be proposed for the mainline. He has been talking with Mimi Zohar about integrating fs-verity with IMA and he plans to discuss its design at the Linux Security Summit.

Comments (8 posted)

A kernel integrity subsystem update

By Jake Edge
May 2, 2018

LSFMM

At the 2018 Linux Storage, Filesystem, and Memory-Management Summit, Mimi Zohar gave a presentation in the filesystem track on the Linux integrity subsystem. There is a lot of talk that the integrity subsystem (usually referred to as "IMA", which is the integrity measurement architecture, though there is more to the subsystem) is complex and not documented well, she said. So she wanted to give an overview of the subsystem and then to discuss some filesystem-related concerns.

The goals of the integrity subsystem are to detect files that have been altered, either accidentally or maliciously, by comparing a measurement of the current file contents with that of a stored "good" value; it can then enforce various file integrity policies (e.g. no access or no execution). IMA is three separate pieces: measurement, which calculates a hash of file contents; appraisal, which verifies file signatures made using the measured hashes; and audit, which records hashes and other information in the audit logs. There is also the extended verification module (EVM), which targets the measurement and protection of the file metadata, though that is not included in what she would be presenting.

It is important to note that IMA does not protect against attacks on objects in memory, it can only be used to thwart attacks that change files. The policies governing IMA behavior for a given system all come from a single file. There are two built-in policies, one that uses the Trusted Platform Module (TPM) to sign a list of file hashes that can be used for attestation to a third party. The other can verify the entire trusted computing base (TCB) of the system. The IMA policy file can itself be signed and verified, of course.

One reason why IMA is seen as complicated and difficult to work with is because there are so many pieces to it. One area that needs work is software distribution with signature verification. RPM has provisions for signatures on package files, but there have been three separate attempts to add signatures to .deb files, without success. Key management is another area; there needs to be a separation between the keys that are used before starting the operating system and those that are trusted after that point.

IMA-audit was added in the 3.10 kernel; it can be used to augment existing security information and event management (SIEM) tools, as FireEye described in 2016. There is a problem, though: it is important to be able to identify the namespace or container that is generating the log entry. There is a proposal for a kind of container ID to be used by the audit subsystem, but nothing has been merged as yet.

Most of the needed hooks to measure files and verify signatures are available. One missing piece is the ability to verify the signature on the root filesystem, which requires a cpio that can handle security extended attributes (xattrs) on files in the initramfs. Some patches to add that functionality to cpio have been posted, but she wonders if there is a maintainer of the tool that will pick up and maintain that work.

All code and data that gets loaded into the kernel needs to be able to be measured and have signatures verified. That includes kernel modules, policy files, and so on. New system calls that take file descriptor parameters allow IMA to access the needed security xattrs. But she is concerned that BPF programs may avoid measurement and verification since file descriptors are not part of the API for loading them.

There are still some kinds of attacks that are not being thwarted; the most major is that file names can be changed in an offline attack. Protecting against that would require hashing and verifying the directory structure. Symbolic and hard links would both provide ways to have different names for the same file, however, as was pointed out by several in the room. It is the reason that SELinux protects objects and not files by name, Ted Ts'o noted.

Namespace support for IMA is another area that needs attention, Zohar said. IMA-measurement needs per-namespace policies and IMA-appraisal needs per-namespace keyrings. She concluded her talk by asking the assembled developers to help ensure that new features did not add measurement or appraisal gaps. There are some difficult problems that need solving for IMA and she would like some help from the filesystem developers in doing so.

Comments (none posted)

PostgreSQL visits LSFMM

By Jake Edge
May 1, 2018

LSFMM

The recent fsync() woes experienced by PostgreSQL led to a session on the first day (April 23) of the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). Those problems also led to a second-day session with PostgreSQL developer Andres Freund who gave an overview of how PostgreSQL does I/O and where that ran aground on some assumptions that had been made. The session led to a fair amount of discussion with the filesystem-track developers; real solutions seem to be in the offing.

PostgreSQL is process-based; there are no threads used, Freund said. It does write-ahead logging (WAL) for durability and replication. That means it logs data before it is marked dirty and the log is flushed before the dirty data is written. Checkpointing is done in the background with writes that are throttled as needed. In general, all data I/O is buffered, though the WAL can use direct I/O.

[Andres Freund]

There is a per-process file descriptor cache with a size limited by the kernel configuration and ulimit, so file descriptors are closed if there are not enough available. On Linux, the dirty data is forced to storage by an explicit sync_file_range() with the SYNC_FILE_RANGE_WRITE flag. Writes come from several sources: the checkpointer writes sorted pages, the background writer does largely random writes, and the backends do random writes. The latter two are pre-cleaning or cleaning various pages, Freund said.

After that brief overview of PostgreSQL I/O, he moved into the issues the project has run into with fsync(). To start with, the guarantees made by Linux (or POSIX) with respect to fsync() behavior are not well documented. One wrong assumption that was made was that retrying an fsync() will fail if the underlying problem has not been fixed. Other operating systems (FreeBSD and Solaris, at least) do have that behavior. Handling that difference is fairly straightforward, he said.

A bigger problem is that it was assumed that fsync() would return an error if there was a writeback failure, which is not necessarily true for Linux. That was never reliable, but it got a bit worse for PostgreSQL after the introduction of errseq_t, which is what led to the recent fallout. Matthew Wilcox has a patch that makes things better, but still provides no guarantee. In order for all of that to work, however, PostgreSQL would need to have at least one file descriptor that stays open from the earliest write, which is not possible at the moment. It is not just PostgreSQL that is affected, Freund said, backup tools, rsync, and others are impacted as well.

Amir Goldstein asked if there were tests to reproduce the problems that PostgreSQL is seeing. Freund said the project has some, but that they need to improve. A crash framework that uses device-mapper failure injection is under development, he said. Ted Ts'o said that xfstests has ways to do that kind of testing as well, so PostgreSQL should look into that for ideas and code.

Freund said that some have suggested that using direct I/O (DIO) would be a solution for the database system. There are architectural issues that make DIO perform poorly for PostgreSQL, but the project is working on them. In addition, DIO is only going to be useful for well-tuned databases—many installed PostgreSQL databases are not.

One of the possible solutions that PostgreSQL has investigated is to pass file descriptors to the checkpointer, which is what will be calling fsync(). One of the problems with that is to figure out which descriptor for the file is the oldest. Wilcox asked whether the descriptors that need to be closed could be synced before they are closed. That would be too slow, Freund said, since there are potentially hundreds or thousands of file descriptors that would be affected.

David Howells asked if a new option to fadvise64() that returns the error count would be helpful. Freund said that would be one of the best solutions to the problems PostgreSQL is having. A per-filesystem error count would be sufficient; the database would then figure out what it needed to do from that.

Jan Kara said that, for the near term, the plan should be to get Wilcox's patch merged and to work up a patch to keep inodes with errors in memory, as had been discussed the day before. If those inodes are not evicted, the errors can be reliably reported. Since then, the patch from Wilcox has been merged, with the stable kernel team being copied, so it should appear in stable kernels too before long.

There was talk of some way to monitor the kernel log for I/O errors (or to get that kind of information reported via netlink sockets, as Google does). That would work, Freund said, but it is overkill. In the end, PostgreSQL does not really care what the error is, just that it occurred. In addition, a fix that doesn't require rsync, tar, and others to change in order to receive errors that way is much preferred.

In closing, Freund asked for some documentation that would tell application developers what needs to be done in order to durably write their data to disk. Dave Chinner claimed that was "asking too much", to a fair amount of laughter. On the other hand, though, no one really stepped up to say they planned to write said documentation either. Freund did post a summary of what he learned at LSFMM to the pgsql-hackers mailing list.

Comments (none posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds