| From: |
| Barry Song <21cnbao-AT-gmail.com> |
| To: |
| Catalin Marinas <catalin.marinas-AT-arm.com>, Will Deacon <will-AT-kernel.org>, Marek Szyprowski <m.szyprowski-AT-samsung.com>, Robin Murphy <robin.murphy-AT-arm.com> |
| Subject: |
| [RFC PATCH 0/5] dma-mapping: arm64: support batched cache sync |
| Date: |
| Wed, 29 Oct 2025 10:31:10 +0800 |
| Message-ID: |
| <20251029023115.22809-1-21cnbao@gmail.com> |
| Cc: |
| Barry Song <v-songbaohua-AT-oppo.com>, Ada Couprie Diaz <ada.coupriediaz-AT-arm.com>, Ard Biesheuvel <ardb-AT-kernel.org>, Marc Zyngier <maz-AT-kernel.org>, Anshuman Khandual <anshuman.khandual-AT-arm.com>, Ryan Roberts <ryan.roberts-AT-arm.com>, Suren Baghdasaryan <surenb-AT-google.com>, Tangquan Zheng <zhengtangquan-AT-oppo.com>, linux-arm-kernel-AT-lists.infradead.org, linux-kernel-AT-vger.kernel.org, iommu-AT-lists.linux.dev |
| Archive-link: |
| Article |
From: Barry Song <v-songbaohua@oppo.com>
Many embedded ARM64 SoCs still lack hardware cache coherency support, which
causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.
For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
sync APIs perform cache maintenance one entry at a time. After each entry,
the implementation synchronously waits for the corresponding region’s
D-cache operations to complete. On architectures like arm64, efficiency can
be improved by issuing all entries’ operations first and then performing a
single batched wait for completion.
Tangquan's initial results show that batched synchronization can reduce
dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
phone platform (MediaTek Dimensity 9500). The tests were performed by
pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
sg entries per buffer) for 200 iterations and then averaging the
results.
Barry Song (5):
arm64: Provide dcache_by_myline_op_nosync helper
arm64: Provide dcache_clean_poc_nosync helper
arm64: Provide dcache_inval_poc_nosync helper
arm64: Provide arch_sync_dma_ batched helpers
dma-mapping: Allow batched DMA sync operations if supported by the
arch
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/assembler.h | 79 +++++++++++++++++++-------
arch/arm64/include/asm/cacheflush.h | 2 +
arch/arm64/mm/cache.S | 58 +++++++++++++++----
arch/arm64/mm/dma-mapping.c | 24 ++++++++
include/linux/dma-map-ops.h | 8 +++
kernel/dma/Kconfig | 3 +
kernel/dma/direct.c | 53 ++++++++++++++++--
kernel/dma/direct.h | 86 +++++++++++++++++++++++++----
9 files changed, 267 insertions(+), 47 deletions(-)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: iommu@lists.linux.dev
--
2.39.3 (Apple Git-146)