| From: |
| Tao Cui <cuitao-AT-kylinos.cn> |
| To: |
| tj-AT-kernel.org, hannes-AT-cmpxchg.org, mkoutny-AT-suse.com, leon-AT-kernel.org, jgg-AT-ziepe.ca |
| Subject: |
| [RFC PATCH rdma-next 0/5] cgroup/rdma: add per-type resource accounting for QP, MR and MR memory |
| Date: |
| Mon, 25 May 2026 13:55:01 +0800 |
| Message-ID: |
| <20260525055506.2002985-1-cuitao@kylinos.cn> |
| Cc: |
| linux-rdma-AT-vger.kernel.org, cgroups-AT-vger.kernel.org, Tao Cui <cuitao-AT-kylinos.cn> |
| Archive-link: |
| Article |
Currently the RDMA cgroup only tracks two aggregate counters:
hca_handle and hca_object. This is too coarse for real-world
deployment: a tenant can exhaust all HCA objects by creating nothing
but QPs, while the administrator has no way to impose separate limits
on QP count, MR count, or the cumulative memory registered through
MRs.
This RFC series adds per-type resource counters for three new
resource types on top of the existing hca_handle / hca_object:
- qp - Queue Pair count
- mr - Memory Region count
- mr_mem - Cumulative MR memory size in bytes
After this series an administrator can set limits like:
echo "mlx5_0 qp=100 mr=500 mr_mem=1073741824" > rdma.max
Design decisions that I would appreciate feedback on:
1. Dual charging: the existing hca_object charge is retained for
QP and MR objects. The per-type counter is charged in addition.
This keeps backward compatibility - existing deployments that rely
on hca_object limits continue to work. An alternative would be
to replace hca_object with per-type counters entirely, but that
breaks the ABI.
2. MR memory is byte-based: unlike QP/MR which are simple counts,
mr_mem tracks the actual length parameter passed at MR
registration time (both ioctl and legacy verbs paths). This
required changing the internal accounting from int to s64. The
match_int parser is replaced with a match_s64 helper using
kstrtoll.
3. Charging point for mr_mem: the byte charge happens after the
MR is created but before the uobject is finalized, so that the
error path can deregister the MR cleanly. The charged byte count
is stored in uobj->rdmacg_mr_mem_bytes so that the generic
destroy / abort paths can uncharge without knowing the MR length.
4. Overflow protection: the s64 addition in rdmacg_try_charge()
checks for both overflow (new < old) and limit exceedance.
Open questions:
- Should hca_object be deprecated in favor of the per-type counters,
or should we keep dual charging indefinitely?
- The mr_mem counter tracks the length requested by the user, not
the actual pinned pages. A process that registers a large MR but
only touches a subset still consumes the full quota. Is this the
right semantic, or should we instead track pinned_page_counts?
This is marked RFC because the cgroup ABI change (new resource types)
is hard to revoke once merged, and I want to make sure the above
design choices are aligned with the maintainers' expectations before
proceeding to a formal submission.
Tao Cui (5):
cgroup/rdma: extend charge/uncharge API with s64 amount parameter
cgroup/rdma: add QP per-type resource counting
cgroup/rdma: add MR per-type resource counting
cgroup/rdma: add MR memory size per-type resource counting
cgroup/rdma: update cgroup resource list for QP, MR and MR_MEM
Documentation/admin-guide/cgroup-v2.rst | 19 ++-
drivers/infiniband/core/cgroup.c | 10 +-
drivers/infiniband/core/core_priv.h | 12 +-
drivers/infiniband/core/rdma_core.c | 48 +++++-
drivers/infiniband/core/uverbs_cmd.c | 16 +-
drivers/infiniband/core/uverbs_std_types_mr.c | 32 ++++
include/linux/cgroup_rdma.h | 10 +-
include/rdma/ib_verbs.h | 2 +
kernel/cgroup/rdma.c | 151 ++++++++++++++----
9 files changed, 243 insertions(+), 57 deletions(-)
--
2.43.0