|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 4.11-rc3, released on March 19. "As is our usual pattern after the merge window, rc3 is larger than rc2, but this is hopefully the point where things start to shrink and calm down."

Stable updates: 4.10.4, 4.9.16, and 4.4.55 were released on March 18, followed by 4.10.5, 4.9.17, and 4.4.56 on March 22.

Comments (none posted)

A new bcachefs release

Kent Overstreet has announced a new major release of his bcachefs filesystem. Changes in this release include whole-filesystem encryption, backup superblocks, better multiple-device support, a user-space filesystem checker, and more. "We can also now migrate filesystems to bcachefs in place! The bcache migrate command takes an existing filesystem, fallocates a big file in it, creates a new filesystem (in userspace) on the block device but using only the space reserved by that file it fallocated - and then walks the contents of the original filesystem creating pointers to all your existing data." There is an on-disk format change, but there's a chance it's the last one.

Comments (17 posted)

Gregg: perf sched for Linux CPU scheduler analysis

Brendan Gregg shows how to do scheduler profiling with the perf sched command. "perf sched timehist was added in Linux 4.10, and shows the scheduler latency by event, including the time the task was waiting to be woken up (wait time) and the scheduler latency after wakeup to running (sch delay). It's the scheduler latency that we're more interested in tuning."

Comments (none posted)

Kernel development news

The 2017 Linux Storage, Filesystem, and Memory-Management Summit

By Jonathan Corbet
March 21, 2017

LSFMM 2017
The 2017 Linux Storage, Filesystem, and Memory-Management Summit was held March 20 and 21 in Cambridge, MA. Around 100 kernel developers gathered to discuss many topics of interests in what has traditionally been one of the most intensely technical events in this community. LWN was fortunate enough to be there and able to write the following reports.

As usual, the summit was held in three tracks, with those tracks joining back together when issues of interest to more than one group were to be discussed.

Plenary topics

The sessions that were attended by the entire group were:

Memory-management sessions

The memory-management developers discussed a number of topics in a smaller setting; these include:

  • HMM and CDM: continuing the discussion of heterogeneous memory management and throwing in the complication of coherent device memory nodes.

  • Slab reclaim: preventing slab-allocated objects from pinning down memory that the kernel would like to put to other uses.

  • Proactive compaction: making sure that higher-order pages are available when the kernel needs them.

  • The next steps for swap. Now that swapping is becoming interesting again, how do we make it perform better?

  • Fast memory allocation for networking: how the memory-management subsystem can help the network stack scale to mind-bogglingly large packet rates.

  • Cpusets and memory policies and the confusing things that can happen when the two are mixed.

  • Supporting shared TLB contexts: what's the best way to support a SPARC processor feature that can improve the performance of some applications?

  • Next steps for userfaultfd(): now that we can handle page faults in user space, what other capabilities would be nice to have?

  • Memory-management patch review: why are MM patches not getting enough review, and what can be done about this problem?

The filesystem track

Topics discussed in the filesystem track include:

Storage and filesystem combined sessions

The storage and filesystem tracks combined for a handful of sessions of interest to both groups:

Filesystem and memory-management session

There was exactly one session that crossed the filesystem and memory-management tracks:

The storage track

LWN's staff at the event, consisting of two people, found it strangely difficult to cover three simultaneous sessions. As a result, there is only one report from the storage track; we hope the storage group will accept our apologies.

Group photo

[Group photo]

[Many thanks to the Linux Foundation for sponsoring LWN's travel to this event.]

Comments (none posted)

ZONE_DEVICE and the future of struct page

By Jonathan Corbet
March 21, 2017

LSFMM 2017
The opening session of the 2017 Linux Storage, Filesystem, and Memory-Management Summit covered a familiar topic: how to represent (possibly massive) persistent-memory arrays to various subsystems in the kernel. This session, led by Dan Williams, focused in particular on the ZONE_DEVICE abstraction and whether the kernel should use page structures to represent persistent memory or not.

ZONE_DEVICE is tied into the memory allocator's zone system (which segregates memory based on attributes like NUMA node or DMA reachability), but in a special way. It was created to satisfy the need to perform DMA operations on persistent memory; these operations require page structures to set up the mappings. ZONE_DEVICE is, he said, essentially the top half of the memory hotplug mechanism; it performs the memory setup, but does not actually put the pages online for general use. So memory located in ZONE_DEVICE cannot be allocated in the normal ways, pages cannot be migrated into that space, etc. But it is possible to get a page structure for memory in that zone.

Over the past few years, as the development community looked at the implications of large persistent-memory arrays, developers were concerned about the cost of using page structures — 64 bytes for every [Dan Williams] 4KB memory page. That usage seemed wasteful, so some significant effort went into trying to avoid using page structures altogether; instead, it was thought that the management of persistent memory could be done entirely with page-frame numbers (PFNs). The pfn_t type, along with a bunch of supporting structure, was added toward that goal, and developers tried to convert the entire DMA API to use PFNs. But then they ran into the SPARC64 architecture, which cannot create DMA mappings without using page structures. The pfn_t effort, Williams said, died there.

Now, he said, perhaps the time has come to stop trying to avoid struct page. If, instead, we let drivers assume that page structures will be available, we'll pay the memory-use cost in systems with terabytes of persistent memory, but we'll avoid dealing with a lot of custom driver code with inconsistent behavior. That would solve the DMA problem, but that's probably the easiest of the problems in this area; struct page tends to pop up in a lot of places.

Matthew Wilcox observed that, in truth, few drivers really care about struct page itself; it really just serves as a convenient handle for referring to physical memory. He suggested that it might make sense to go back and take a hard look at why SPARC is stuck with using page structures; Williams said it had to do with the management of cache aliasing state. James Bottomley suggested that there may be other ways to solve this problem, such as using a separate array to hold aliasing information. It would just be a matter of persuading SPARC maintainer Dave Miller.

If that persuasion could be accomplished, then pfn_t could be used nearly everywhere and there would be less need to worry about the availability of page structures. A remaining problem might be drivers that need to reach directly into DMA buffers but, Wilcox said, they should just use ioremap() to get a usable address to work with.

One of the big motivations for avoiding struct page with persistent-memory arrays is that these structures can end up filling a large portion of the system's ordinary memory. The way to avoid that, of course, is to allocate the structures in the persistent-memory itself; Wilcox said that, whenever new memory is added to the system, its associated page structures should always be located in that new memory. The problem is that struct page can be a heavily used structure, so there is value in having the ability to control its placement.

One possible solution to the memory-use problem is to allow page structures to refer to larger pages — 2MB huge pages, for example. The problem here is that making the size variable would add overhead to some of the hottest code paths in the kernel. There would be CPU-time savings in some areas, since the number of page structures to be managed would be reduced considerably, but there are doubts that the savings would make up for the higher costs in places like the page allocator.

Another option, Williams said, is to allocate page structures dynamically when they are needed. A persistent-memory array can be terabytes in size, but page structures may only be needed for a small portion of it. If allocation of page structures can be made cheap, it would make sense to only bring them into existence when the need arises.

The conversation wound down in a wandering manner. Bottomley suggested using radix trees to track ranges of memory instead. Kirill Shutemov pointed out that different kinds of information are needed for different page sizes; in the case of transparent huge pages, it may be necessary to refer to a 4KB page as both a single page and a component of a huge page. Rik van Riel said that page structures are only really an issue for dynamic RAM; they can be dispensed with for persistent memory, since filesystems can be counted on to free memory when it's no longer in use. Bottomley replied that this approach is possible, but nobody has been willing to implement it so far, leading Williams to observe that the group would be talking about the same problem again next year.

Comments (8 posted)

Unaddressable device memory

By Jake Edge
March 22, 2017

LSFMM 2017

In a morning plenary session on the first day of the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Jérôme Glisse led a discussion on memory that cannot be addressed by the CPU because it lives in devices like GPUs or FPGAs. There is often a substantial pile of memory on these devices and it can be accessed much more quickly by the devices than the system RAM can be. Making it easier for user-space programmers to use that memory transparently is the goal of the heterogeneous memory management (HMM) patches that Glisse has been working on.

GPUs and FPGAs can have sizable chunks of memory (8-64GB currently, he said) onboard; currently that memory is accessed by using mmap() on the device file. That was fine for graphics (e.g. OpenGL) and for the first version of the OpenCL heterogeneous processing framework, but it does result in a split address space. The devices can access both the system memory and their own local memory, but the CPU cannot really use the device memory because it is not cache-coherent and does not support atomic operations.

[Jérôme Glisse]

The programs written to run on these devices use malloc() for their data structures; they need to pin that system memory so that the GPU or FPGA can access them. Programs can, instead, replicate their data structures into the device's memory, but it is cumbersome and bug prone to do so. What is needed is a shared address space where system memory can be migrated to the device memory transparently, he said.

This shared address space is becoming an industry standard; Windows can do it and C++ standards require it in order to use device memory. OpenCL version 2.0 and beyond need it as does the CUDA parallel programming framework. Handling the memory migration transparently allows programmers to write code that does not have to be aware of whether it is running on the GPU or not.

Today, GPU memory bandwidth is 1TB per second, while PCIe is 32GB per second and system memory can be accessed at 80GB per second (with four-channel DDR memory). The GPU can access its data much more quickly, so the CPU becomes the bottleneck for memory access.

There is the possibility of a hardware solution, but no common hardware today can present the device memory as regular memory that is cache-coherent and supports atomic operations from both the device and the CPU. So a software solution is needed, he said.

HMM is using the ZONE_DEVICE allocation zone type but in a different way than was presented in the previous plenary. The device memory is tagged as ZONE_DEVICE, but system memory is allowed to migrate there. From the CPU perspective it is like swapping the memory to disk; if it needs to access the page, the CPU will get a page fault and has to migrate the memory back from the device.

Glisse has two possible solutions in mind to provide the shared address space. The first would protect system memory pages so that they cannot be read or written. The pages would then be duplicated on the device and all reads and writes from the CPU would be intercepted.

Someone asked about the size of the migrations, noting that doing 4KB at a time would not work well. Glisse said that in a typical situation it would be at least a few megabytes and that common use cases would migrate 10GB to the device. The GPU would crunch on that data and then migrate back the results. In between, there would normally be no access to that memory from the CPU. It is definitely dependent on the workload but avoiding migration of 4KB pages one at a time is important.

There is a potential for pages ping-ponging between CPU memory and device memory, which also needs to be avoided. He is aware of that problem and believes that in the long run drivers will track that kind of access and not migrate pages to the device if the CPU accesses them. Mel Gorman noted that while there is the potential for memory to ping-pong between the two, it is an application bug if it happens.

Dave Hansen pointed out that malloc() will allocate memory for things that should be migrated together with other data that should not be. Glisse acknowledged that, but there will be some pages that simply will not migrate to the device, which is not a big problem since they still can be accessed by the device from the system memory. There also some tools available to help do the right kind of allocations to avoid the problem.

One problem with mirroring data on the device is that the memory is duplicated. A different idea would be to migrate the memory completely to ZONE_DEVICE pages on the device that are not accessible to the CPU at all. Any read or write to the memory from the CPU will trigger a migration back. That requires catching any kind of read or write, including from system calls or direct I/O.

Al Viro pointed out that the system call or direct I/O level is not the right level to intercept the operations that require a migration back to the CPU. Rather the interception should be done at calls to get_user() and put_user(). If you catch all accesses from user space, that will ensure that all of the reads and writes to the memory are handled, he said.

Dan Williams was concerned that Glisse is using ZONE_DEVICE differently than it is being used for persistent-memory arrays. He wondered how the two uses could be differentiated. His patches are able to distinguish between the two uses, Glisse said.

Glisse then moved into how to protect a page from reads and writes. He suggested that it could be done by making what the kernel shared memory (KSM) feature is doing "more generic". For pages that need to be protected, the page->mapping entry would be replaced with a pointer to a protection structure and all dereferences of page->mapping would be wrapped with a helper function.

He has patches to make that change that were generated using Coccinelle, so they are simply mechanical changes. It is good to reuse the existing KSM mechanism, he said, but wondered if there would be impacts elsewhere from the changes. Someone from the audience said that the wrapper would be useful for other reasons as well, however. Glisse said he would push those patches soon to make it available for those other uses.

There is a need to do writeback from the device memory pages and Glisse suggested using the ISA block queue bounce buffer code "from the 90s" that is still in the kernel. But James Bottomley pointed out that bounce buffers are used for more than just ISA devices; unaligned struct bio writes also use them. The upshot is that code that is being used now will definitely work for what Glisse is trying to do. Glisse would also like to give the system administrator some sort of knob to control writeback from device memory, so that those who want to squeeze out every last bit of performance (at the risk of filesystem integrity) can do so.

An audience member suggested that kernel developers were coming to a "shared agreement" about how the kernel should refer to memory that is not directly addressable. That kind of memory would get page structures, but it would not be directly accessible from the kernel. There seemed to be general assent to that idea.

There are a number of corner cases that Williams is concerned about and he is still not convinced that reusing ZONE_DEVICE for HMM makes sense. Glisse said he could add a new zone type if needed. Gorman is not particularly concerned about the corner cases, for now; Glisse agreed, saying that if those cases become problems, they can be looked at then.

There are no upstream users of the HMM feature, but Glisse expects that the Nouveau driver will use it, as will AMD GPU drivers and various FPGA drivers down the road. Users will need to be aware that the system memory will need to be larger than all of the device memory available (or used); otherwise the system could livelock when migrating results back from the device memory, Gorman said. Glisse agreed that it could be a problem, but the patches will be useful to see how big of a problem it is in practice.

Gorman thinks the focus should be on getting HMM into the kernel; interactions with filesystems (i.e. writeback) can be handled down the road. Glisse agreed, saying that he is working on getting HMM into the kernel as it is today, but wanted to discuss some of the other issues to try to make sure that the ideas he has for dealing with filesystem interaction are not way off base.

Comments (2 posted)

HMM and CDM

By Jonathan Corbet
March 22, 2017

LSFMM 2017
The first topic in the memory-management track of the 2017 Linux Storage, Filesystem, and Memory-Management Summit continued a discussion begun during the preceding plenary session, which had introduced the heterogeneous memory management (HMM) issue and updated the group on the status of those patches. In this session, Jérôme Glisse focused the discussion on what needs to be done to move this work forward. Balbir Singh then followed up with a different approach to the HMM problem where more hardware support is available.

Pushing HMM forward

The HMM discussion started with a question from the group: what features does a device like a GPU need to be able to support HMM? The answer is that it needs some sort of a page-table structure that can be used to set the access permissions on each page of memory. That enables, for example, execute permissions to be set on either the CPU or the GPU (but not on both), depending on which kind of code is found in the relevant pages. HMM also needs to be able to prevent simultaneous writes by both processors, so the GPU needs to be able to handle faults.

Dave Hansen asked whether more was needed than an I/O memory-management unit (IOMMU) could provide. Glisse responded that the IOMMU is there primarily to protect the system from I/O devices, which is a different use case. Mel Gorman added that HMM needs to be able to trap write faults on a specific page and provide different protections on each side — things an IOMMU cannot do.

There is work underway to use the KSM mechanism to do write protection; those patches will be posted soon. KSM allows the same page to be mapped into multiple address spaces, a feature [Jérôme Glisse] which will be useful, especially on systems with multiple GPUs, all of which need access to the same data.

Andrea Arcangeli started a discussion on the handling of write faults. Normally such faults on shared pages lead to copy-on-write (COW) operations, but the situation here is different; the response in the HMM setting is to ensure that the writing side has exclusive access to the memory in question. Gorman raised some worries about the semantics of writes after fork() calls by processes using HMM. fork() works by marking writable pages for COW, but it's not clear what should happen if the pages are COW-mapped into both the parent and child; a write fault could end up giving ownership of the page to one process or the other in a timing-dependent (i.e. racy) manner.

To avoid this eventuality, Gorman suggested that all memory used with HMM should be marked as MADV_DONTFORK using the madvise() system call; that would cause that memory to not be made available to the child of a fork() at all. Indeed, he said that it should be mandatory. He relented a bit, though, after it was explained that all HMM memory is pulled into the parent process at fork() time, with none left in the GPU. He was willing to accept the situation as long as it is clear that the HMM memory is associated with the parent and is not visible to the child.

With that resolved, Gorman asked if there were any remaining obstacles to merging. Hansen mentioned that HMM will not work with systems that already have the maximum amount of memory installed; there simply is no physical address space for the GPU memory. Gorman replied that this problem will indeed come up in practice and users will be burned by it, but that it is a limitation of the hardware and is not a reason to block the merging of the HMM patches.

Dan Williams expressed a concern that the HMM patches place GPU memory into the ZONE_DEVICE zone, which is also used for persistent memory. The two uses are distinct and can get along, but the code around ZONE_DEVICE becomes that much easier to break if a developer making a change doesn't understand all of the users. Gorman suggested that Williams should do a detailed review of the HMM code from a ZONE_DEVICE perspective; the long-term maintainability of this code is a fundamental issue, he said, and needs to be considered carefully. Johannes Weiner suggested, in jest, that ZONE_HIGHMEM could be used instead, to which Gorman told him to "go home".

The final concern is the lack of any drivers for the HMM code; if it is merged in its current form, it will be dead code with no users. There is, it seems, some hope that a Nouveau-based driver for NVIDIA GPUs will be available by the time the 4.12 merge window opens. Gorman suggested to Andrew Morton that the HMM code could be kept in the -mm tree until at least one driver becomes available, but Morton asked whether it would really be a problem for the code to go upstream as-is. What he would most like, instead, is a solid explanation of what the code is for so he can justify it to Linus Torvalds when the time comes.

The end result is that HMM has a few hurdles to get over still, but its path into the mainline is beginning to look a little more clear.

Coherent device memory nodes

Singh then stepped forward to describe the IBM view of HMM; for IBM, the problem has been mostly solved in hardware. On suitably equipped systems, the device memory shows up as if it were on its own NUMA node that happens to lack a CPU. That memory is entirely cache-coherent with the rest of the system, [Balbir Singh] though. There is a patch series under development to support these "coherent device memory nodes" (or CDMs) on Linux.

There are still a number of questions about how such hardware should work with Linux. The desire is to provide selective memory allocation: user-space applications could choose whether memory should be allocated in normal or CDM memory. Reclaim needs to be handled carefully, though, since the kernel may not have a full view of how the memory on the CDM is being used. For obvious reasons, normal NUMA balancing needs to be disabled, or pages will be move into and out of the CDM incorrectly. When migrations are desired, they should be accelerated using DMA engines.

The plan for CDM memory is to allocate it on the CPU, but then to run software using that memory on the CDM's processor. The device is able to access its own memory and normal system memory transparently via pointers. The hope is to migrate memory to the most appropriate node based on the observed usage patterns. Hansen noted that the NUMA balancing code in the kernel works fairly well, but most people still turn it off; will there really be a call for it in this setting? Singh responded that it can make a big difference; hints from the application can also help.

Thus far, the patches include support for isolating the CDMs using the cpuset mechanism. But the system doesn't have enough information to do memory balancing properly yet. The zone lists have been split to separate out CDM memory; that serves to hide it from most of the system and avoid confusion with regular memory. At the end of the session, transparent hugepage migration was raised as another missing piece, but that topic was deferred for later discussion.

Comments (none posted)

Slab reclaim

By Jonathan Corbet
March 22, 2017

LSFMM 2017
"Reclaim" is the process of finding memory in the system that is not in immediate use and can be recovered for other uses. Michal Hocko started this 2017 Linux Storage, Filesystem, and Memory-Management Summit session by noting that the reclaiming of objects obtained from the slab allocator is far from perfect in current kernels. Along with Christoph Lameter, he explored options for improving that situation.

Slab reclaim is handled with shrinker callbacks; when the system needs more memory, the shrinkers are called and asked to free some objects. The hope is that the shrinkers will manage to do something useful, but that is not what really happens. The biggest problem is that there is no connection between the pages the kernel would like to free and the objects that have been allocated from those pages. All objects in a page must be freed to make the page itself freeable, but there is no way to focus shrinker activity on the objects in a specific page.

Some years ago, Hocko said, Dave Chinner had come up with an interesting idea: rely on the slab allocators more for reclaim. If the allocators kept objects in least-recently-used (LRU) lists, they could perhaps reclaim those objects in a more useful way. But nobody cared enough to implement that suggestion, so it remains just an idea.

[Christoph Lameter
and Michal Hocko]

Lameter then talked about a different approach that he has been working on for some time. It involves adding a couple of callbacks to the slab allocator which could be used to ask a subsystem to relocate objects that are in the way of freeing a page. The first of those would be something like:

    void *isolate_object(struct kmem_cache *cache, void **objs,
    			 int nr, int node);

This function should prepare to relocate the objects found in **objs; it should, among other things, ensure that all of the objects are stable and will not be changed until the operation is complete. Once that is done, the second callback will be invoked:

    void migrate_objects(struct kmem_cache *cache, void **objs,
    			 int nr, int node, void *private);

This callback should try to move the given objs to a new location; it can also simply free them if that is the better course. Once it's done, if all the objects in a given slab page have been moved, the page itself can be freed.

The first implementation of this mechanism was done in 2007. Perhaps, Lameter suggested, the time has come to merge it and start making use of it. As memory sizes get larger, he said, the need for better slab reclaim will only get more urgent.

Andrea Arcangeli suggested a different approach: simply allocate slab objects from virtually mapped pages. Then, if the page needs to be relocated, it is simply a matter of changing the mapping in the page table. This would enable easy movement of slab-allocated objects between nodes while completely avoiding the need to track pointers to the objects themselves. That avoids what was described as the main downside of Lameter's scheme: the need to add mobility to each type of slab-allocated object.

The problem with this approach, as Rik van Riel pointed out, is that it is not useful if the objective is to move slab objects to defragment pages. That might be the most important use case, he said; he has seen many systems out there with a lot of memory tied up in slabs that are 95% empty. Arcangeli responded that there are three uses for this sort of mechanism: memory hotplug, compaction, and out-of-memory avoidance, in that order. His virtually-mapped idea addresses the most important of the three, he said, and can even work with objects allocated with kmalloc(), which are otherwise problematic.

The session came to an end at this point without having reached any real decisions. This conversation will need to continue on the mailing lists, presumably in the presence of specific patches to discuss.

Comments (2 posted)

Proactive compaction

By Jonathan Corbet
March 21, 2017

LSFMM 2017
One of the goals of memory compaction, Vlastimil Babka said at the beginning of his memory-management track session at the 2017 Linux Storage, Filesystem, and Memory-Management Summit, is to make higher-order pages available. These pages — suitably aligned groups of physically contiguous pages — have many uses, including supporting the transparent huge pages feature. Compaction is mostly done on demand in the kernel now; he would like to do more of it ahead of time instead.

The scheme he has in mind involves doing compaction in the background, outside of the context of any specific process. The kswapd thread would be woken from the memory-allocation slow path, and would be expected to reclaim a certain number of single pages. It would then wake the separate kcompactd thread with a desired size for the higher-order pages. This thread would do compaction until a page of the desired order is available, or until the entire memory zone has been scanned. That may not be enough, though, since, at the end, it will have created only a single higher-order page.

He asked the crowd for ideas on how to make this scheme better. Michal Hocko suggested adding a configuration option; the administrator could set a watermark and a time period. The compaction thread would then check each period and try to ensure that the desired number of pages are available. But Babka objected that this behavior doesn't really seem like something [Vlastimil Babka] that administrators can be expected to configure properly. They are focused on parameters like transparent huge page allocation rates or network throughput and will be hard put to translate that to desired numbers of free pages. It would be better, he said, to have the system tune itself if possible.

What would be the inputs to an auto-tuning solution? The first would be recent demand for pages of each order. Even better would be future demand, of course, but, in its absence, the best that can be done is to assume that future behavior will not differ too much from the recent past. It might also be desirable to track the importance of each request; transparent huge pages are an opportunistic optimization, while higher-order pages for the SLUB allocator can be hard to do without. The other useful input would be the success rate of recent compaction attempts; if compaction isn't working, there is no point in continuing to try it. Mel Gorman suggested also tracking the number of compaction requests that come in while the compaction itself is running.

Andrea Arcangeli pointed out that it will be necessary to protect large pages created by compaction from normal allocation requests. Otherwise, the kernel might work to put together a higher-order page, only to have it immediately broken up again in response to a single-page allocation. When compaction is done directly from an allocation request this problem does not arise, since the resulting large page would be used right away. The proactive approach is promising, he said, but the protection problem needs to be addressed for it to work.

The proactive compaction feature is a work in progress, Babka said; an RFC patch was sent out recently. It tries to track the number of allocations that would have succeeded with more kcompactd activity. Essentially, those are situations where there are enough free pages in the system, but they are too fragmented to use. The patches are not currently tracking the importance of allocation requests; perhaps the GFP flags could be used for that purpose. There is also no long-term averaging of demand. For now, it simply runs until there are enough high-order pages available.

One remaining problem is evaluating the value of this work. The existing artificial benchmarks, he said, are reaching their limits in this area.

Concerns were raised that background compaction might increase a system's power usage. Hocko said that this kind of worry was why he had suggested a configuration knob for this feature. Babka replied that power consumption should not be a big problem; compaction responds to actual demand on the system, so it should not be active when the system is otherwise mostly idle.

As the session came to a close, Arcangeli suggested that perhaps subsystems with large-page needs could register with the compaction code and indicate how many pages they would like to have available. Babka said that he would like to go as far as he can without the addition of any sort of tuning knobs, though. Johannes Weiner said there would be value in an on/off switch, since any sort of proactive work risks wasting resources in some environments. Any more tuning than that should be avoided, though, he said. It was generally agreed that this feature looked valuable, but that it should start as simple as possible with the idea that more complexity could be added later if needed.

Comments (none posted)

The next steps for swap

By Jonathan Corbet
March 22, 2017

LSFMM 2017
Swapping has long been an unloved corner of the kernel's memory-management subsystem. As a general rule, the thinking went, if a system starts swapping the performance battle has already been lost, so there is little reason to try to optimize the performance of swapping itself. The growth of fast solid-state storage devices is changing that calculation, though, making swapping interesting again. At the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Tim Chen led a session in the memory-management track that looked at the ways that swapping performance can be improved.

Chen has been working on swapping performance for a while; the first set of swap scalability patches has already been merged. His next priority is improving swap readahead performance. This mechanism, which tries to read pages from swap ahead of an anticipated need for them, currently reads pages back in the order in which they were swapped out. This, he noted, is not necessarily the best order and, with mixed access patterns, performance can be poor.

The recently submitted VMA-based swap readahead patches try to improve readahead performance by watching the swap-in behavior of each virtual memory area (VMA). If it appears that memory is being accessed in a serial fashion, the readahead window is increased in the hope of bringing in more pages before they are needed. For random patterns, instead, readahead has little value, so the window is reduced.

Rik van Riel noted that the current readahead algorithm was designed for rotational media and asked how well the VMA-based mechanism works on such devices. Chen, with visible embarrassment, said that this hasn't been tried. Van Riel added that, with rotational devices, a group of adjacent [Tim Chen] blocks can be read as quickly as a single block can, so it makes sense to speculatively read extra data. The same is not true for solid-state storage. So, he suggested, it might make sense for the readahead code to see which type of device is hosting the swap space and change its behavior accordingly.

Matthew Wilcox, instead, said that the real problem might be at swap-out time. Pages are swapped based on their position on the least-recently-used (LRU) lists, which may not reflect the order in which they will be needed again. He said that, perhaps, writes to swap could be buffered; swapped pages would go into a "victim cache" and sorted before being written to storage. The value of this approach wasn't clear to everybody in the room, though, given that access patterns can change over time.

The next subject was the swapping of transparent huge pages. Currently, the first step is to split those pages into their component single pages, then to write those to swap individually — not the most efficient way to go about things. Chen and company would like to improve this behavior in a few steps, the first of which is to delay the splitting of the page until space has been allocated in the swap area. That should result in the allocation of a single cluster of pages for the entire huge page, at which point the whole thing can be written in a single operation. Patches implementing this change have been submitted; they result in a 14% swap-out performance increase.

The next step is to delay the splitting of huge pages further, until the swap-out operation is finished. Those patches are in development; benchmarking shows that they result in a 37% improvement in swap-out performance.

Finally, it would be nice to be able to swap huge pages directly back in. This idea needs more thought, he said. It is not always a performance win; if the application only needs a couple of small pages of data, there is no point in bringing in the whole huge page. One possible heuristic could be to only swap in huge pages for memory regions marked with MADV_HUGEPAGE or which have a large readahead window.

There was a bit of discussion on how to justify the inclusion of these patches once they are ready. The best motivator is good benchmark results. It was suggested that Linus Torvalds is less likely to block the patches if they do not slow down kernel builds. Michal Hocko said that the patches were interesting, but that they were optimizing a rare event; the current code assumes that we don't ever want to swap. But Johannes Weiner said that the swap-out changes, at least, make a lot of sense; batching operations by keeping huge pages together will speed things up.

The next topic was the use of the DAX direct-access mechanism with swapped data. If swapping is done to a persistent memory array, the data can still be accessed directly without the need to read it back into RAM. There is "an almost-working prototype" that does this, Chen said. The hard part is deciding when it makes sense to bring pages back into RAM; memory that will be frequently accessed, especially if the accesses are writes, is better read back in.

Wilcox said that the decision really depends on the performance difference between dynamic RAM and persistent memory on the system in question; in some cases, the right answer might be "never". Sometimes, for example, the "persistent-memory array" is actually dynamic RAM hosted in a hypervisor. There was some talk of using the system's performance-monitoring unit (PMU) to track page accesses, but that idea didn't get far. Developers prefer that the kernel not take over the PMU, the runtime cost is high, and the results are not always all that useful.

After some discussion, the conclusion reached was that the kernel should just bring a random set of pages back into RAM occasionally. With luck, the frequently used pages will stay there, while the rest will age back out to swap.

Finally, there was a brief discussion of further optimizing the swap-device locking, which still sees significant contention even after the recent scalability improvements. So there is some interest in using lock elision toward this end.

Comments (4 posted)

Fast memory allocation for networking

By Jonathan Corbet
March 22, 2017

LSFMM 2017
At the 2016 Linux Storage, Filesystem, and Memory-Management Summit, Jesper Dangaard Brouer proposed the creation of a new set of memory-allocation APIs to meet the performance needs of the networking stack. In 2017, he returned to the LSFMM memory-management track to update the community on the work that has been done in that area — and what still needs to be accomplished.

Networking, he said, deals with "mind-boggling speeds"; a 10GB Ethernet link can handle up to nearly 15 million packets per second. On current hardware, that gives the operating system only about 200 processor cycles to deal with each packet. The problem gets worse as link speeds increase, of course.

The main trick used in solving this problem is batching operations whenever possible. That is not a magic solution, he said; batching ten operations does not yield a 10x performance improvement. But it does help a lot and needs to be done. Furthermore, various kernel-bypass networking solutions [Jesper Brouer] show that processing packets at these rates is possible; they work using batching and special memory allocators. They also use techniques like polling, which wastes CPU time; he thinks that the kernel can do better than that.

One step that has been taken in that direction is the merging of the express data path (XDP) mechanism around the 4.9 development cycle. With XDP, it is possible to achieve full wire speeds in the kernel, but only if the memory-management layer is avoided. That means holding onto buffers, but also keeping them continually mapped for DMA operations. When that is done, a simple "drop every packet" application using XDP can handle 17 million packets per second, while an application that retransmits each packet through the same interface it arrived on can handle 10 million packets per second. These benchmarks may seem artificial, but they solve real-world problems: blocking denial-of-service attacks and load balancing. Facebook is currently using XDP for these tasks, he said.

What has not been done with XDP so far is real packet forwarding, because that requires interactions with memory management. The page allocator is simply too slow, so current drivers work by recycling the pages they have allocated. Every high-performance driver has implemented some variant of this technique, he said. It would be good to move some of this functionality into common code.

The general statement of the problem is that drivers want to get DMA-mapped pages and keep them around for multiple uses. The memory-management layer can help by providing faster per-CPU page caching (some work toward that goal was merged recently), but it still can't compete with simply recycling pages in the drivers. So he has another idea: create a per-device allocator for DMA-mapped pages with a limited cache. By keeping pages mapped for the device, this allocator could go a long way toward reducing memory-management costs.

Matthew Wilcox asked if the existing DMA pool API could be used for this purpose. The problem, Brouer said, is that DMA pools are oriented toward coherent DMA operations (where long-lived buffers are accessed by both the CPU and the device), while networking uses streaming DMA operations (short-lived buffers that can only be accessed by one side or the other at any given time).

What he really wants, Brouer continued, is to be able to provide a destructor callback that is invoked when a page's reference count drops to zero. That callback would be allowed to "steal" the page, keeping it available for use in the same driver. This callback mechanism actually exists now, but only for higher-order pages; bringing it to single pages would require finding room in the crowded page structure, which is not an easy task. Pages with destructors might also need a page flag to identify them, which is another problem; those flags are in short supply. There was some discussion of tricks that could be employed (such as placing a sentinel value in the mapping field) to shoehorn the needed information into struct page; it seems likely that some kind of solution could be found.

Brouer concluded with some benchmarks showing that the situation got better in the 4.11 kernel, thanks to the page-caching improvements done by Mel Gorman. But there is still a lot of overhead, much of which turns out to be in the maintenance of the zone statistics. These statistics are not needed for the operation of the memory-management subsystem itself, but it seems that quite a few users do make use of them to tune their systems. Gorman said that, when performance regresses, users typically report the problem within a release cycle or two, suggesting that they are indeed looking at the numbers.

So the statistics need to remain, but it may be possible to disable their collection on production systems. The statistics code could probably be shorted out with a static branch in settings where they are not wanted. It is deemed worthwhile to run the benchmarks with NUMA disabled to see if any benefit is to be found there.

At the end, Brouer asked whether there would be objections to a DMA-page pool mechanism. There were no immediate objections, but the developers in the room made it clear that they would want to see the patches before coming to any definite conclusions.

Comments (3 posted)

Cpusets and memory policies

By Jonathan Corbet
March 22, 2017

LSFMM 2017
"Cpusets" are a Linux kernel mechanism that enables control over which processors a given process is allowed to run on. The memory policy (or "mempolicy") mechanism, instead, gives control over how a process's memory is allocated across the nodes of a NUMA system. As Vlastimil Babka explained during the memory-management track of the 2017 Linux Storage, Filesystem, and Memory-Management Summit, these two mechanisms do not always play well together, with some surprising and unfortunate consequences.

Cpusets are an administrator-controlled mechanism; unprivileged processes cannot normally change their CPU assignments. Mempolicies, instead, are under the control of the processes themselves. If both mechanisms are used together, one might logically expect that memory would be allocated on the set of nodes defined by the intersection of the cpuset and the mempolicy. If that intersection is empty, "then there is space for creativity". But that is not what actually happens.

Babka put up a pair of slides showing what can happen. Imagine a process running on a four-node system; initially both the cpuset and the mempolicy are set to nodes zero and one. In this case, memory will be allocated from either of those two nodes, as one might expect. If the cpuset is changed to nodes one and two, the memory allocations will follow to those two nodes. But, if the cpuset is first reduced to a single node (node two in the example), then restored to the original zero and one, the result will be allocations from node zero only; the kernel will have lost track of the fact that the mempolicy called for both nodes to be used.

This problem was understood and addressed in the 2.6.26 kernel through the addition of a couple of flags to the set_mempolicy() system call. If the process sets its mempolicy with the MPOL_F_STATIC_NODES flag, that policy will not change when the cpuset is changed. MPOL_F_RELATIVE_NODES, instead, causes the policy to move along with cpuset changes while remembering the original policy, so it will never exhibit the single-node allocation behavior described above.

What happens if there is no intersection between the cpuset and the mempolicy, as can happen, especially, with MPOL_F_STATIC_NODES? The answer is that it will allocate memory from the cpuset nodes. Kirill Shutemov suggested that perhaps allocations should fail instead in that circumstance, but that was deemed to be unfriendly behavior and an ABI break as well. It is better to allocate memory on the wrong node than to kill an otherwise working program, especially if that program did work on older kernels. In general, it was agreed, the set_mempolicy() interface is broken, but it is going to be hard to fix now.

One serious problem with the current implementation is its behavior when the cpuset is being changed, forcing the mempolicy to be changed as well. There is a period of time during that change when an empty node list causes the kernel to conclude that it is out of memory. That can lead to spurious invocations of the out-of-memory killer, an outcome that tends to get a cold reception in the user community.

Fixing that problem seems necessary and urgent. The mempolicy updates associated with cpuset changes have to be maintained, since the alternative is an ABI break. For the static case, the solution is straightforward, since the set of nodes will not change. In the relative case, instead, the remapping will need to be done on the fly; it is a solvable problem but the solution looks complex. There may be no workable fix for the default case.

The discussion focused on the details of how a fix might work for the remainder of the session. It may involve moving the list of allowed memory zones back into the cpuset itself, which is how mempolicies were once implemented many years ago. The plan, as is so often the case, is to wait for a patch to appear and see how the solution looks at that time.

Comments (none posted)

Stack and driver testing

By Jake Edge
March 22, 2017

LSFMM 2017

In a combined storage and filesystem session at the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Chaitanya Kulkarni led a discussion on testing for drivers for storage devices and other parts of the storage stack. He wanted to come up with a framework for testing drivers that is coupled tightly to the kernel. Right now, there are test suites that are scattered about; he wanted to get some ideas and feedback on a unified test framework.

Hannes Reinecke asked what kind of testing Kulkarni was targeting: functionality, performance, or something else. Kulkarni said that functional testing was the first step, but then moving toward a more complex set of test cases would be the goal.

[Chaitanya Kulkarni]

One area that is not being verified is that the hardware and the kernel are both implementing cache flushing correctly, James Bottomley said. Making sure that the driver gets this right is one of the hardest things to test. But Jens Axboe said there are tools like his Flexible I/O Tester (fio) that can be used to record what has been written to a device; cutting the power and then verifying that all of the writes recorded have actually made it to the device can help. Chris Mason noted that it takes a large number of systems to do these kinds of tests, however.

Kulkarni wondered if there should be a way to test the entire block layer whenever a new commit is made to it. He asked: is there a way to do so and is that kind of testing needed? Bottomley channeled Dave Chinner and said that if that kind of testing is to be done it should be added to xfstests. The reason to do so is that the filesystem developers run it regularly, which provides a "pool of willing guinea pigs". In addition, attendees suggested a number of different tests and/or test suites that could be run as part of the unified testing.

But Mason suggested splitting things up a bit. While the filesystem developers do run xfstests "all the time", they don't have the hardware that Kulkarni (who works for Western Digital) is interested in. Using the device mapper to create emulators for hardware of interest that could be used by filesystem developers running xfstests would help with that. The developers will run those tests, so then it is just a matter of making sure that the emulator accurately reflects what the hardware does.

Ted Ts'o agreed, noting that shingled magnetic recording (SMR) devices is an area that needs testing, but that it is not easy for filesystem developers to test with them. If there were a device mapper emulator that provided the write pointer and other pieces of the SMR functionality, that would allow xfstests (and other tests) to be run. That would go a long way toward shaking out the filesystem-SMR interactions. "We can worry about edge cases later."

Testing with a device mapper emulator is fine for filesystems, but things like the block I/O scheduler need to be tested as well, one attendee said. Axboe said that those kinds of tests would exercise the multiqueue scheduler, but there are still tests needed on the block layer side. It could be something similar to xfstests that the block developers would run and would test the block and storage layer pieces of the stack. It could include tests for different device types like SMR as well as tests for kernel features like hotplug. If someone builds the framework, "tests will come", Axboe said.

In fact, he volunteered to help put the framework together. Josef Bacik also recommended adding it to xfstests, which Axboe said he wouldn't object to. The xfstests framework already has much of what would be needed, Bacik said, though Bottomley cautioned that changes would need to made to not require a filesystem, as xfstests does, for some portion of the test suite.

There is also the need to show what code paths are actually being tested, an audience member said, but another noted that there is gcov support in the kernel, which can be used to look at the test coverage.

An attendee was concerned that there is no hotplug testing that is being done, but Bacik said that some of the Btrfs tests use it. He is worried that there is no support for device-level testing in xfstests, so that would need to be added. Ts'o wants to ensure that whatever tests get created can be run by developers without needing access to special hardware, so device mapper emulators will need to be created.

Bottomley said that once xfstests is modified so that it doesn't require a filesystem to be mounted as part of the test, some subsystem-specific tests could be added for SCSI, NVMe, RDMA, and others. Bacik agreed; xfstests works well now for everything until you get to the block layer. The takeaway from the discussion is that block layer tests should be added to xfstests without requiring a filesystem then add the subsystem-specific tests, he said.

Mason added that some tests that don't require persistence should be part of the effort; things like device availability and enumeration. As new bugs arise, tests detecting those problems should be added as well, Axboe said. He doesn't want to make assumptions about what a test case will look like. There might be tests for specific kinds of hardware and so on. Whoever does the work will get to choose the form of the tests that can be run from the framework, he said.

Kulkarni wondered if some of the storage-subsystem-specific tests could be shared between the pieces like SCSI, NVMe, and others. Bottomley said that may be possible, but it is important to keep the focus on the features that the kernel cares about. Various hardware devices make guarantees and provide features that Linux does not use, so there is no value to the kernel community in testing those parts. A coverage map will help guide where more testing needs to be done. Time ran out on the session, but it appears there was strong agreement about the right path to take.

Comments (none posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.11-rc3 Mar 19
Greg KH Linux 4.10.5 Mar 22
Greg KH Linux 4.10.4 Mar 18
Greg KH Linux 4.9.17 Mar 22
Greg KH Linux 4.9.16 Mar 18
Greg KH Linux 4.4.56 Mar 22
Greg KH Linux 4.4.55 Mar 18
Steven Rostedt 4.4.53-rt66 Mar 16
Julia Cartwright 4.1.39-rt47 Mar 17
Ben Hutchings Linux 3.16.42 Mar 16
Jiri Slaby Linux 3.12.72 Mar 17
Steven Rostedt 3.12.71-rt96 Mar 16
Ben Hutchings Linux 3.2.87 Mar 16

Architecture-specific

Madhavan Srinivasan IMC Instrumentation Support Mar 16

Build system

Core kernel code

Device drivers

Elaine Zhang rk808: Add RK805 support Mar 16
Andrey Smirnov GPCv2 power gating driver Mar 16
M'boumba Cedric Madianga Add support for the STM32F7 I2C Mar 17
Eric Anholt ARM CLCD DRM driver Mar 17
sean.wang@mediatek.com leds: add leds-mt6323 support on MT7623 SoC Mar 20
Andrey Smirnov i.MX7 PCI support Mar 21
matthew.gerlach@linux.intel.com Altera Partial Reconfiguration IP Mar 21
Sebastian Reichel Nokia H4+ support Mar 21
Ramiro Oliveira Add support for Omnivision OV5647 Mar 22
Yannick Fertre STM32 Independant watchdog Mar 22

Device driver infrastructure

Documentation

Filesystems and block I/O

David Howells ext4: Add statx support Mar 16

Memory management

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds