The slab and protected-memory allocators
Slab allocators
The kernel's slab allocator is charged with allocating (usually) small chunks of memory for the rest of the kernel; it sits behind interfaces like kmalloc() and kmem_cache_alloc(). Vlastimil Babka led a session to discuss a couple of issues that have come up in the slab allocators. The first of those has to do with reclaimable slabs, which are used to allocate kernel objects that can be freed on request to defragment memory.
The kernel's dentry cache (which caches the results of filesystem lookups)
is allocated from a reclaimable slab. But when a specific dentry refers to
a particularly long name, that name won't fit into the dentry structure
itself and must be allocated separately. That allocation, done with
kmalloc(), is not directly reclaimable. In theory that is not a huge
problem, since a call to the dentry shrinker will reclaim both pieces of
memory. But the kernel's accounting of how much memory is truly
reclaimable is thrown off by this allocation pattern; the kernel thinks
there is less reclaimable memory than there really is and goes needlessly
into the out-of-memory state.
A solution can be found in this patch set from Roman Gushchin. It creates a new counter (nr_indirectly_reclaimable) to track the memory used by objects that can be freed by shrinking a different object. Babka is not entirely happy with the patch set, though. The name of the counter forces users to be concerned with "indirectly" reclaimable memory, which they shouldn't have to do. It is an ad hoc solution, he said, that should not become a part of the kernel ABI.
A better solution, he said, would be to make a separate set of reclaimable slabs for those kmalloc() calls. That would keep the reclaimable objects together, which would be better from a fragmentation point of view. Memory would be allocated from these slabs by providing the GFP_RECLAIMABLE flag. Babka asked the group whether he should pursue this idea, and got a positive response.
That leaves open the problem of this new counter, though, which has been merged for 4.17. Michal Hocko suggested simply reverting the patch; this accounting has been broken for years, and can stay that way a little longer. But others questioned whether it was really an ABI issue at all; Johannes Weiner said that counters have been removed before without ill effect.
Babka's other topic was the provision of slab caches for objects that are larger than one page, but whose size is not a power of two. Such objects are not handled efficiently now; a request to allocate a 640KB object, for example, will return a 1MB region, which is wasteful. The memory waste can be addressed by using alloc_pages_exact(), but that adds complexity and can cause memory fragmentation.
Instead, he suggested, it would be useful if the slab allocators could use larger (2MB) blocks of memory for these objects. That would reduce the amount of internal fragmentation considerably. It was generally agreed that this could be done, but there would need to be some changes to some of the heuristics that are used. Generally, code allocating these objects has a fallback path should the allocation fail, so the allocator itself should not fall back to smaller regions should the 2MB allocation fail. But the GFP_NORETRY flag can reduce the chances of that 2MB allocation succeeding in the first place, so that isn't the solution either.
As the session came to an end, Christoph Lameter pointed out that there is a min_order parameter that can be used to force the use of larger slabs now. It significantly increases performance, but it also applies to all slabs, which is probably not wanted on most systems. The solution would be to turn it into a per-slab parameter, he said.
Protectable memory
Igor Stoppa's protectable memory patch set was examined on LWN in March. He ran an LSFMM session to make the case for this functionality and to get some feedback from the developers. Protectable memory can be made read-only once it has been initialized, making it harder for an attacker to change it as part of a system compromise. It is meant to solve problems like those found on Android systems, where users install questionable apps, some of which may try to exploit a kernel vulnerability to change important data in the kernel.
The usual sequence for this sort of attack, he said, is to start by taking
over an existing app, perhaps via a phishing attack. Then a kernel
vulnerability is used to gain write access to some kernel data. The
attacker must locate that data, which means defeating kernel address-space
layout randomization, but there are usually leaks that can be used for that
purpose. Once the attacker is able to make changes, the first order of
business is to disable SELinux, after which it becomes possible to escalate
to unconstrained root access in user space.
It is hard to close off all of the vulnerabilities that an attacker might try to use, Stoppa said. He can't fix user space or the various out-of-tree drivers that ship on such devices. But he might be able to prevent the disabling of SELinux. Most attacks on SELinux try to make changes to (or disable entirely) the policy database; making that data read-only should raise the bar considerably.
The existing mechanisms for creating read-only data in the kernel are not up to this task, though. Data can be marked read-only at the end of kernel initialization, but that is too soon for SELinux, since the policy must come from user space. It would be possible to use vmalloc() to allocate this database and change the page protections, but this approach would create a lot of fragmentation and TLB contention. So he created a new pmalloc() interface instead.
Dave Hansen asked for performance numbers relative to using vmalloc(). Stoppa does not have those numbers now; Hansen requested that he run some tests and provide them. Since this is a performance-oriented patch, the actual performance gain needs to be demonstrated.
A recent addition to pmalloc() is the "rare write" mechanism for cases when the data must be made writable again for a short period. That involves creating a new pool type for modifiable data. When modification happens, the new data is mapped into a different location, hopefully making it harder for an attacker to find.
Hugh Dickins asked about changing protected memory via the kernel's direct mapping. This mapping remains writable and, since it uses huge pages, it is hard to change the protections for individual (small) pages. Stoppa agreed that the direct mapping is a potential problem; it might be fixable on the x86 architecture, but not on ARM. Dickins responded that, in that case, one might as well just use kmalloc(). But Hansen disagreed, saying that it can be hard to find a specific object's location in the direct mapping, so there is some benefit to using pmalloc(). But he was unsure about how big that benefit is, and would like to hear what the security developers think.
This patch set has been through three rewrites so far. One problem is that
these patches add the mechanism but not do not add any users of it, which
makes merging harder. The problem here, Stoppa said, is that it is hard to
find simple use cases. Getting the more complex users (such as SELinux) is
hard without the API in the kernel, but getting the API merged is difficult
without the users.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Slab allocators |
| Kernel | Security/Kernel hardening |
| Conference | Storage, Filesystem, and Memory-Management Summit/2018 |
