|
|
Log in / Subscribe / Register

mm: reliable 1GB page allocation

From:  Rik van Riel <riel-AT-surriel.com>
To:  linux-kernel-AT-vger.kernel.org
Subject:  [RFC PATCH 00/40] mm: reliable 1GB page allocation
Date:  Wed, 20 May 2026 10:59:06 -0400
Message-ID:  <20260520150018.2491267-1-riel@surriel.com>
Cc:  kernel-team-AT-meta.com, linux-mm-AT-kvack.org, david-AT-kernel.org, willy-AT-infradead.org, surenb-AT-google.com, hannes-AT-cmpxchg.org, ljs-AT-kernel.org, ziy-AT-nvidia.com, usama.arif-AT-linux.dev, fvdl-AT-google.com
Archive-link:  Article


Some workloads see real performance benefits from using 1GB pages,
but allocating 1GB pages has often been limited to hugetlb pages
that were set aside at boot time, or using CMA to keep a fixed
amount of system memory off limits to the kernel.

Neither of those are great solutions, given that modern servers
tend to be large, often run multiple workloads simultaneously,
and each workload wants something else.

To address that issue, this patch series divides memory not just
into 2MB page blocks, but into PUD sized superpageblocks, and
aggressively tries to steer unmovable, reclaimable, and highatomic
allocations into those superpageblocks that have already been
"tainted" by such allocations.

The goal is to leave as many 1GB superpageblocks as possible
used by only movable allocations, so they can be easily
defragmented for either regular PMD sized huge pages, or
for PUD sized huge pages.

Various strategies are used to accomplish this goal:
- unmovable and reclaimable allocations are preferentially
  done from 1GB blocks that have already been "tainted" by
  these allocations
- kernel allocations that can be done as one higher order
  allocation, or a number of smaller allocations (eg. kvmalloc)
  will fall back to small pages, rather than taint a new
  1GB block
- movable allocations are preferentially done from clean 1GB
  blocks, which have only free and movable memory inside,
  starting with the fullest of these 1GB blocks
- 2MB allocations follow the same strategy
- 1GB allocations start with the emptiest clean 1GB block
- if a 1GB block is mixed, with some movable pageblocks,
  some free pageblocks, and some unmovable/reclaimable pageblocks,
  the system has a free threshold below which only unmovable and
  reclaimable allocations can be done from that 1GB block
- below that threshold, no new movable allocations are allowed
  in that 1GB block, while new unmovable/reclaimable allocations
  are still allowed
- when a 1GB block is below that threshold, use the migration
  code to evacuate enough movable memory from the 1GB block
  to bring free memory in that 1GB block back to the threshold

These strategies together serve to concentrate unmovable and
reclaimable allocations in as few 1GB blocks as possible,
leaving as many 1GB blocks as possible available for movable
allocations.

That enables both more extensive use of 2MB THPs and mTHPs,
as well as reliable allocation of 1GB pages.

The above strategies also make the core page allocator
more complicated, and slower. In order to avoid that issue,
the series is built on top of Johannes's PCPBuddy series,
which has the goal of reducing how often CPUs need to get
pages from the zone free lists, instead relying on CPUs
giving back pages to each other, based on page block ownership.

TODO:
- compaction "always" succeeds, with a success rate of 99.96% seen
  in traces; this sounds great, but it also results in compaction
  never being throttled, and compaction blowing out everybody's
  PCP through lru_add_drain() calls. This needs some sort of solution.
- replace the superpageblock name with something Matthew and David
  both like
- find more corner cases, and fix them

Based on e1914add2799





Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds