| From: |
| Mikołaj Lenczewski <miko.lenczewski-AT-arm.com> |
| To: |
| catalin.marinas-AT-arm.com, will-AT-kernel.org, corbet-AT-lwn.net, maz-AT-kernel.org, oliver.upton-AT-linux.dev, joey.gouly-AT-arm.com, suzuki.poulose-AT-arm.com, yuzenghui-AT-huawei.com |
| Subject: |
| [RFC PATCH v1 0/5] Initial BBML2 support for contpte_convert() |
| Date: |
| Wed, 11 Dec 2024 15:45:01 +0000 |
| Message-ID: |
| <20241211154611.40395-1-miko.lenczewski@arm.com> |
| Cc: |
| Mikołaj Lenczewski <miko.lenczewski-AT-arm.com>, linux-arm-kernel-AT-lists.infradead.org, liunx-doc-AT-vger.kernel.org, linux-kernel-AT-vger.kernel.org, kvmarm-AT-vger.kernel.org |
| Archive-link: |
| Article |
Hi All,
This patch series seeks to gather feedback on adding initial support
for level 2 of the Break-Before-Make arm64 architectural feature,
specifically to contpte_convert().
This support reorders a TLB invalidation in contpte_convert(), and
optionally elides said invalidation completely which leads to a 12%
improvement when executing a microbenchmark designed to force the
pathological path where contpte_convert() gets called. This
represents an 80% reduction in the cost of calling contpte_convert().
However, the elision of the invalidation is still pending review to
ensure it is architecturally valid. Without it, the reodering also
represents a performance improvement due to reducing thread contention,
as there is a smaller time window for racing threads to see an invalid
pagetable entry (especially if they already have a cached entry in their
TLB that they are working off of).
This series is based on v6.13-rc2 (fac04efc5c79).
Break-Before-Make Level 2
=========================
Break-Before-Make (BBM) sequences ensure a consistent view of the
page tables. They avoid TLB multi-hits and ensure atomicity and
ordering guarantees. BBM level 0 simply defines the current use
of page tables. When you want to change certain bits in a pte,
you need to:
- clear the pte
- dsb()
- issue a tlbi for the pte
- dsb()
- repaint the pte
- dsb()
When changing block size, or toggling the contiguous bit, we
currently use this BBM level 0 sequence. With BBM level 2 support,
however, we can relax the BBM sequence and benefit from a performance
improvement. The hardware would then either automatically handle the
TLB invalidations, or would take a TLB Conflict Abort Exception.
This exception can either be a stage 1 or stage 2 exception, depending
on whether stage 1 or stage 2 translations are in use. The architecture
currently mandates a worst-case invalidation of vmalle1 or vmalls12e1,
when stage 2 translation is not in-use and in-use respectively.
Outstanding Questions and Remaining TODOs
=========================================
Patch 4 moves the tlbi so that the window where the pte is invalid is
significantly smaller. This reduces the chances of racing threads
accessing the memory during the window and taking a fault. This is
confirmed to be architecturally sound.
Patch 5 removes the tlbi entirely. This has the benefit of
significantly reducing the cost of contpte_convert(). While testing
has demonstrated that this works as expected on Arm-designed CPUs, we
are still in the process of confirming whether it is architecturally
correct. I am requesting review while that process is on-going. Patch 5
would be dropped if it turns out to be architecturally unsound.
Another note is that the stage 2 TLB conflict handling is included as
patch 1 of this series. This patch could (and probably should) be sent
separately as it may be useful outside this series, but is included for
reference.
Thanks,
Miko
Mikołaj Lenczewski (5):
arm64: Add TLB Conflict Abort Exception handler to KVM
arm64: Add BBM Level 2 cpu feature
arm64: Add errata and workarounds for systems with broken BBML2
arm64/mm: Delay tlbi in contpte_convert() under BBML2
arm64/mm: Elide tlbi in contpte_convert() under BBML2
Documentation/arch/arm64/silicon-errata.rst | 32 ++++
arch/arm64/Kconfig | 164 ++++++++++++++++++++
arch/arm64/include/asm/cpufeature.h | 14 ++
arch/arm64/include/asm/esr.h | 8 +
arch/arm64/kernel/cpufeature.c | 37 +++++
arch/arm64/kvm/mmu.c | 6 +
arch/arm64/mm/contpte.c | 3 +-
arch/arm64/mm/fault.c | 27 +++-
arch/arm64/tools/cpucaps | 1 +
9 files changed, 290 insertions(+), 2 deletions(-)
--
2.45.2