Defer throttle when task exits to user
From: | Aaron Lu <ziqianlu-AT-bytedance.com> | |
To: | Valentin Schneider <vschneid-AT-redhat.com>, Ben Segall <bsegall-AT-google.com>, K Prateek Nayak <kprateek.nayak-AT-amd.com>, Peter Zijlstra <peterz-AT-infradead.org>, Josh Don <joshdon-AT-google.com>, Ingo Molnar <mingo-AT-redhat.com>, Vincent Guittot <vincent.guittot-AT-linaro.org>, Xi Wang <xii-AT-google.com> | |
Subject: | [RFC PATCH v2 0/7] Defer throttle when task exits to user | |
Date: | Wed, 09 Apr 2025 20:07:39 +0800 | |
Message-ID: | <20250409120746.635476-1-ziqianlu@bytedance.com> | |
Cc: | linux-kernel-AT-vger.kernel.org, Juri Lelli <juri.lelli-AT-redhat.com>, Dietmar Eggemann <dietmar.eggemann-AT-arm.com>, Steven Rostedt <rostedt-AT-goodmis.org>, Mel Gorman <mgorman-AT-suse.de>, Chengming Zhou <chengming.zhou-AT-linux.dev>, Chuyi Zhou <zhouchuyi-AT-bytedance.com>, Jan Kiszka <jan.kiszka-AT-siemens.com> | |
Archive-link: | Article |
This is a continuous work based on Valentin Schneider's posting here: Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry https://lore.kernel.org/lkml/20240711130004.2157737-1-vsc... Valentin has described the problem very well in the above link. We also have task hung problem from time to time in our environment due to cfs quota. It is mostly visible with rwsem: when a reader is throttled, writer comes in and has to wait, the writer also makes all subsequent readers wait, causing problems of priority inversion or even whole system hung. To improve this situation, change the throttle model to task based, i.e. when a cfs_rq is throttled, mark its throttled status but do not remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when they get picked, add a task work to them so that when they return to user, they can be dequeued. In this way, tasks throttled will not hold any kernel resources. When cfs_rq gets unthrottled, enqueue back those throttled tasks. There are consequences because of this new throttle model, e.g. for a cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their return2user path, one task still running in kernel mode, this cfs_rq is in a partial throttled state: - Should its pelt clock be frozen? - Should this state be accounted into throttled_time? For pelt clock, I chose to keep the current behavior to freeze it on cfs_rq's throttle time. The assumption is that tasks running in kernel mode should not last too long, freezing the cfs_rq's pelt clock can keep its load and its corresponding sched_entity's weight. Hopefully, this can result in a stable situation for the remaining running tasks to quickly finish their jobs in kernel mode. For throttle time accounting, I can see several possibilities: - Similar to current behavior: starts accounting when cfs_rq gets throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq gets unthrottled. This has one drawback, e.g. if this cfs_rq has one task when it gets throttled and eventually, that task doesn't return to user but blocks, then this cfs_rq has no tasks on throttled list but time is accounted as throttled; Patch2 and patch3 implements this accounting(simple, fewer code change). - Starts accounting when the throttled cfs_rq has at least one task on its throttled list; stops accounting when it's unthrottled. This kind of over accounts throttled time because partial throttle state is accounted. - Starts accounting when the throttled cfs_rq has no tasks left and its throttled list is not empty; stops accounting when this cfs_rq is unthrottled; This kind of under accounts throttled time because partial throttle state is not accounted. Patch7 implements this accounting. I do not have a strong feeling which accounting is the best, it's open for discussion. There is also the concern of increased duration of (un)throttle operations in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks setup on a 2sockets/384cpus AMD server, the longest duration of distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see: https://lore.kernel.org/lkml/20250324085822.GA732629@byte... For throttle path, with Chengming's suggestion to move "task work setup" from throttle time to pick time, it's not an issue anymore. Patches: Patch1 is preparation work; Patch2-3 provide the main functionality. Patch2 deals with throttle path: when a cfs_rq is to be throttled, mark throttled status for this cfs_rq and when tasks in throttled hierarchy gets picked, add a task work to them so that when those tasks return to user space, the task work can throttle it by dequeuing the task and remember this by adding the task to its cfs_rq's limbo list; Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled, enqueue back those tasks in limbo list; Patch4 deals with the dequeue path when task changes group, sched class etc. Task that is throttled is dequeued in fair, but task->on_rq is still set so when it changes task group or sched class or has affinity setting change, core will firstly dequeue it. But since this task is already dequeued in fair class, this patch handle this situation. Patch5-6 are clean ups. Some code are obsolete after switching to task based throttle mechanism. Patch7 implements an alternative accounting mechanism for task based throttle. Changes since v1: - Move "add task work" from throttle time to pick time, suggested by Chengming Zhou; - Use scope_gard() and cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work(), suggested by K Prateek Nayak; - Remove now obsolete throttled_lb_pair(), suggested by K Prateek Nayak; - Fix cfs_rq->runtime_remaining condition check in unthrottle_cfs_rq(), suggested by K Prateek Nayak; - Fix h_nr_runnable accounting for delayed dequeue case when task based throttle is in use; - Implemented an alternative way of throttle time accounting for discussion purpose; - Make !CONFIG_CFS_BANDWIDTH build. I hope I didn't omit any feedbacks I've received, but feel free to let me know if I did. As in v1, all change logs are written by me and if they read bad, it's my fault. Comments are welcome. Base commit: tip/sched/core, commit 6432e163ba1b("sched/isolation: Make use of more than one housekeeping cpu"). Aaron Lu (4): sched/fair: Take care of group/affinity/sched_class change for throttled task sched/fair: get rid of throttled_lb_pair() sched/fair: fix h_nr_runnable accounting with per-task throttle sched/fair: alternative way of accounting throttle time Valentin Schneider (3): sched/fair: Add related data structure for task based throttle sched/fair: Handle throttle path for task based throttle sched/fair: Handle unthrottle path for task based throttle include/linux/sched.h | 4 + kernel/sched/core.c | 3 + kernel/sched/fair.c | 449 ++++++++++++++++++++++-------------------- kernel/sched/sched.h | 7 + 4 files changed, 248 insertions(+), 215 deletions(-) -- 2.39.5