summaryrefslogtreecommitdiffstats
path: root/tools/sched_ext
AgeCommit message (Collapse)AuthorLines
2026-03-02tools/sched_ext: Add -fms-extensions to bpf build flagsZhao Mengmeng-0/+2
Similar to commit 835a50753579 ("selftests/bpf: Add -fms-extensions to bpf build flags") and commit 639f58a0f480 ("bpftool: Fix build warnings due to MS extensions") The kernel is now built with -fms-extensions, therefore generated vmlinux.h contains types like: struct aes_key { struct aes_enckey; union aes_invkey_arch inv_k; }; struct ns_common { ... union { struct ns_tree; struct callback_head ns_rcu; }; }; Which raise warning like below when building scx scheduler: tools/sched_ext/build/include/vmlinux.h:50533:3: warning: declaration does not declare anything [-Wmissing-declarations] 50533 | struct ns_tree; | ^ Fix it by using -fms-extensions and -Wno-microsoft-anon-tag flags to build bpf programs that #include "vmlinux.h" Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-27tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq()David Carlier-2/+5
scx_hotplug_seq() uses strtoul() but validates the result with a negative check (val < 0), which can never be true for an unsigned return value. Use the endptr mechanism to verify the entire string was consumed, and check errno == ERANGE for overflow detection. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-24tools/sched_ext: Add Kconfig to sync with upstreamCheng-Yang Chou-0/+61
Add the missing Kconfig file to tools/sched_ext/ as referenced in the README. Ref: https://github.com/sched-ext/scx/blob/main/kernel.config Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-24tools/sched_ext: Sync README.md Kconfig with upstream scxCheng-Yang Chou-6/+0
Sync the documentation with the upstream scx repository to reflect the current recommended configuration. Ref: https://github.com/sched-ext/scx/blob/main/README.md#build--install Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23tools/sched_ext: scx_sdt: Remove unused '-f' optionCheng-Yang Chou-1/+1
The '-f' option is defined in getopt() but not handled in the switch statement or documented in the help text. Providing '-f' currently triggers the default error path. Remove it to sync the optstring with the actual implementation. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23tools/sched_ext: scx_central: Remove unused '-p' optionCheng-Yang Chou-1/+1
The '-p' option is defined in getopt() but not handled in the switch statement or documented in the help text. Providing '-p' currently triggers the default error path. Remove it to sync the optstring with the actual implementation. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-20tools/sched_ext: fix getopt not re-parsed on restartDavid Carlier-0/+6
After goto restart, optind retains its advanced position from the previous getopt loop, causing getopt() to immediately return -1. This silently drops all command-line options on the restarted skeleton. Reset optind to 1 at the restart label so options are re-parsed. Affected schedulers: scx_simple, scx_central, scx_flatcg, scx_pair, scx_sdt, scx_cpu0. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-20tools/sched_ext: scx_userland: fix data races on shared countersDavid Carlier-13/+13
The stats thread reads nr_vruntime_enqueues, nr_vruntime_dispatches, nr_vruntime_failed, and nr_curr_enqueued concurrently with the main thread writing them, with no synchronization. Use __atomic builtins with relaxed ordering for all accesses to these counters to eliminate the data races. Only display accuracy is affected, not scheduling correctness. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-18tools/sched_ext: scx_pair: fix stride == 0 crash on single-CPU systemsDavid Carlier-1/+6
nr_cpu_ids / 2 produces stride 0 on a single-CPU system, which later causes SCX_BUG_ON(i == j) to fire. Validate stride after option parsing to also catch invalid user-supplied values via -S. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-18tools/sched_ext: scx_central: fix CPU_SET and skeleton leak on early exitDavid Carlier-1/+2
Use CPU_SET_S() instead of CPU_SET() on the dynamically allocated cpuset to avoid a potential out-of-bounds write when nr_cpu_ids exceeds CPU_SETSIZE. Also destroy the skeleton before returning on invalid central CPU ID to prevent a resource leak. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-16tools/sched_ext: scx_userland: fix stale data on restartDavid Carlier-0/+8
Reset all counters, tasks and vruntime_head list on restart. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-16tools/sched_ext: scx_flatcg: fix potential stack overflow from VLA in ↵David Carlier-4/+9
fcg_read_stats fcg_read_stats() had a VLA allocating 21 * nr_cpus * 8 bytes on the stack, risking stack overflow on large CPU counts (nr_cpus can be up to 512). Fix by using a single heap allocation with the correct size, reusing it across all stat indices, and freeing it at the end. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-12tools/sched_ext: scx_userland: fix restart and stats thread lifecycle bugsDavid Carlier-2/+3
Fix three issues in scx_userland's restart path: - exit_req is not reset on restart, causing sched_main_loop() to exit immediately without doing any scheduling work. - stats_printer thread handle is local to spawn_stats_thread(), making it impossible to join from main(). Promote it to file scope. - The stats thread continues reading skel->bss after the skeleton is destroyed on restart, causing a use-after-free. Join the stats thread before destroying the skeleton to ensure it has exited. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-12tools/sched_ext: scx_central: fix sched_setaffinity() call with the set sizeDavid Carlier-2/+4
The cpu set is dynamically allocated for nr_cpu_ids using CPU_ALLOC(), so the size passed to sched_setaffinity() should be CPU_ALLOC_SIZE() rather than sizeof(cpu_set_t). Valgrind flagged this as accessing unaddressable bytes past the allocation. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-12tools/sched_ext: scx_flatcg: zero-initialize stats counter arrayDavid Carlier-0/+1
The local cnts array in read_stats() is not initialized before being accumulated into per-CPU stats, which may lead to reading garbage values. Zero it out with memset alongside the existing stats array initialization. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-11Merge tag 'sched_ext-for-6.20' of ↵Linus Torvalds-8/+2567
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Move C example schedulers back from the external scx repo to tools/sched_ext as the authoritative source. scx_userland and scx_pair are returning while scx_sdt (BPF arena-based task data management) is new. These schedulers will be dropped from the external repo. - Improve error reporting by adding scx_bpf_error() calls when DSQ creation fails across all in-tree schedulers - Avoid redundant irq_work_queue() calls in destroy_dsq() by only queueing when llist_add() indicates an empty list - Fix flaky init_enable_count selftest by properly synchronizing pre-forked children using a pipe instead of sleep() * tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: selftests/sched_ext: Fix init_enable_count flakiness tools/sched_ext: Fix data header access during free in scx_sdt tools/sched_ext: Add error logging for dsq creation failures in remaining schedulers tools/sched_ext: add arena based scheduler tools/sched_ext: add scx_pair scheduler tools/sched_ext: add scx_userland scheduler sched_ext: Add error logging for dsq creation failures sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()
2026-02-02tools/sched_ext: Fix data header access during free in scx_sdtEmil Tsalapatis-1/+1
Fix a pointer arithmetic error in scx_sdt during freeing that causes the allocator to use the wrong memory address for the allocation's data header. Fixes: 36929ebd17ae ("tools/sched_ext: add arena based scheduler") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-01tools/sched_ext: Add error logging for dsq creation failures in remaining ↵zhidao su-4/+34
schedulers Add scx_bpf_error() calls when scx_bpf_create_dsq() fails in the remaining schedulers to improve debuggability: - scx_simple.bpf.c: simple_init() - scx_sdt.bpf.c: sdt_init() - scx_cpu0.bpf.c: cpu0_init() - scx_flatcg.bpf.c: fcg_init() This follows the same pattern established in commit 2f8d489897ae ("sched_ext: Add error logging for dsq creation failures") for other schedulers and ensures consistent error reporting across all schedulers. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-27tools/sched_ext: add arena based schedulerEmil Tsalapatis-1/+925
Add a scheduler that uses BPF arenas to manage task context data. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-27tools/sched_ext: add scx_pair schedulerEmil Tsalapatis-1/+800
Add the scx_pair cgroup-based core scheduler. Cc: Tejun Heo <tj@kernel.org> Cc: David Vernet <dvernet@meta.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-27tools/sched_ext: add scx_userland schedulerEmil Tsalapatis-1/+799
Add in the scx_userland scheduler that does vruntime-based scheduling in userspace code and communicates scheduling decisions to BPF by accessing and modifying globals through the skeleton. Cc: Tejun Heo <tj@kernel.org> Cc: David Vernet <dvernet@meta.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5Alexei Starovoitov-4/+6
Cross-merge BPF and other fixes after downstream PR. No conflicts. Adjacent: Auto-merging MAINTAINERS Auto-merging Makefile Auto-merging kernel/bpf/verifier.c Auto-merging kernel/sched/ext.c Auto-merging mm/memcontrol.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-12sched_ext: Add error logging for dsq creation failuresGeorge Guo-4/+12
Add scx_bpf_error() calls when scx_bpf_create_dsq() fails in multiple schedulers to improve debuggability: - scx_central.bpf.c: central_init() - scx_flatcg.bpf.c: fcg_cgroup_init() and fcg_init() - scx_qmap.bpf.c: qmap_init() Signed-off-by: George Guo <guodongtai@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-28tools/sched_ext: update scx_show_state.py for scx_aborting changeKohei Enju-2/+1
Commit a69040ed57f5 ("sched_ext: Simplify breather mechanism with scx_aborting flag") removed scx_in_softlockup and scx_breather_depth, replacing them with scx_aborting. Update the script accordingly. Fixes: a69040ed57f5 ("sched_ext: Simplify breather mechanism with scx_aborting flag") Signed-off-by: Kohei Enju <enjuk@amazon.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-28tools/sched_ext: fix scx_show_state.py for scx_root changeKohei Enju-2/+5
Commit 48e126777386 ("sched_ext: Introduce scx_sched") introduced scx_root and removed scx_ops, causing scx_show_state.py to fail when searching for the 'scx_ops' object. [1] Fix by using 'scx_root' instead, with NULL pointer handling. [1] # drgn -s vmlinux ./tools/sched_ext/scx_show_state.py Traceback (most recent call last): File "/root/.venv/bin/drgn", line 8, in <module> sys.exit(_main()) ~~~~~^^ File "/root/.venv/lib64/python3.14/site-packages/drgn/cli.py", line 625, in _main runpy.run_path( ~~~~~~~~~~~~~~^ script_path, init_globals={"prog": prog}, run_name="__main__" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "<frozen runpy>", line 287, in run_path File "<frozen runpy>", line 98, in _run_module_code File "<frozen runpy>", line 88, in _run_code File "./tools/sched_ext/scx_show_state.py", line 30, in <module> ops = prog['scx_ops'] ~~~~^^^^^^^^^^^ _drgn.ObjectNotFoundError: could not find 'scx_ops' Fixes: 48e126777386 ("sched_ext: Introduce scx_sched") Signed-off-by: Kohei Enju <enjuk@amazon.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-19lib/Kconfig.debug: Set the minimum required pahole version to v1.22Ihor Solodrai-1/+0
Subsequent patches in the series change vmlinux linking scripts to unconditionally pass --btf_encode_detached to pahole, which was introduced in v1.22 [1][2]. This change allows to remove PAHOLE_HAS_SPLIT_BTF Kconfig option and other checks of older pahole versions. [1] https://github.com/acmel/dwarves/releases/tag/v1.22 [2] https://lore.kernel.org/bpf/cbafbf4e-9073-4383-8ee6-1353f9e5869c@oracle.com/ Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Nicolas Schier <nsc@kernel.org> Link: https://lore.kernel.org/bpf/20251219181825.1289460-1-ihor.solodrai@linux.dev
2025-11-20sched_ext: tools: Removing duplicate targets during non-cross compilationRong Tao-0/+2
When cross-compilation is not used, BPFOBJ and HOST_BPFOBJ are identical files, libbpf.a, and duplicate libbpf.a files should be removed. Signed-off-by: Rong Tao <rongtao@cestc.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Add scx_cpu0 example schedulerTejun Heo-1/+195
Add scx_cpu0, a simple scheduler that queues all tasks to a single DSQ and only dispatches them from CPU0 in FIFO order. This is useful for testing bypass behavior when many tasks are concentrated on a single CPU. If the load balancer doesn't work, bypass mode can trigger task hangs or RCU stalls as the queue is long and there's only one CPU working on it. v2: Check whether task is on CPU0 at enqueue using scx_bpf_task_cpu() instead of nr_cpus_allowed (Andrea Righi). Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29sched_ext/tools: Restore backward compat with v6.12 kernelsTejun Heo-8/+90
Commit 111a79800aed ("tools/sched_ext: Strip compatibility macros for cgroup and dispatch APIs") removed the compat layer for v6.12-v6.13 kfunc renaming, but v6.12 is the current LTS kernel and will remain supported through 2026. Restore backward compatibility so schedulers built with v6.19+ headers can run on v6.12 kernels. The restored compat differs from the original in two ways: 1. Uses ___new/___old suffixes instead of ___compat for clarity. The new macros check for v6.13+ names (scx_bpf_dsq_move*), fall back to v6.12 names (scx_bpf_dispatch_from_dsq*, scx_bpf_consume), then return safe no-ops for missing symbols. 2. Integrates with the args-struct-packing changes added in c0d630ba347c ("sched_ext: Wrap kfunc args in struct to prepare for aux__prog"). scx_bpf_dsq_insert_vtime() now tries __scx_bpf_dsq_insert_vtime (args struct), then scx_bpf_dsq_insert_vtime___compat (v6.13-v6.18), then scx_bpf_dispatch_vtime___compat (pre-v6.13). Forward compatibility is not restored - binaries built against v6.13 or earlier headers won't run on v6.19+ kernels, as the old kfunc names are not exported. This is acceptable since the priority is new binaries running on older kernels. Also add missing compat checks for ops.cgroup_set_bandwidth() (added v6.17) and ops.cgroup_set_idle() (added v6.19). These need to be NULLed out in userspace on older kernels. Reported-by: Andrea Righi <arighi@nvidia.com> Acked-and-tested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhereTejun Heo-11/+51
The ops.cpu_acquire/release() callbacks miss events under multiple conditions. There are two distinct task dispatch gaps that can cause cpu_released flag desynchronization: 1. balance-to-pick_task gap: This is what was originally reported. balance_scx() can enqueue a task, but during consume_remote_task() when the rq lock is released, a higher priority task can be enqueued and ultimately picked while cpu_released remains false. This gap is closeable via RETRY_TASK handling. 2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local DSQ. By the time the sched path runs on the target CPU, higher class tasks may already be queued. In such cases, nothing on sched_ext side will be invoked, and the only solution would be a hook invoked regardless of sched class, which isn't desirable. Rather than adding invasive core hooks, BPF schedulers can use generic BPF mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent with other mechanisms it already uses and doesn't add further friction. The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when a CPU gets preempted by a higher priority scheduling class. However, the old scx_bpf_reenqueue_local() could only be called from cpu_release() context. Add a new version of scx_bpf_reenqueue_local() that can be called from any context by deferring the actual re-enqueue operation. This eliminates the need for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF mechanisms like the sched_switch tracepoint to detect and handle CPU preemption. Update scx_qmap to demonstrate the new approach using sched_switch instead of cpu_release, with compat support for older kernels. Mark cpu_acquire/release() as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in v6.23. Reported-by: Wen-Fang Liu <liuwenfang@honor.com> Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/ Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-24sched_ext: Add ___compat suffix to scx_bpf_dsq_insert___v2 in compat.bpf.hTejun Heo-3/+5
2dbbdeda77a6 ("sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibility") renamed the new bool-returning variant to scx_bpf_dsq_insert___v2 in the kernel. However, libbpf currently only strips ___SUFFIX on the BPF side, not on kernel symbols, so the compat wrapper couldn't match the kernel kfunc and would always fall back to the old variant even when the new one was available. Add an extra ___compat suffix as a workaround - libbpf strips one suffix on the BPF side leaving ___v2, which then matches the kernel kfunc directly. In the future when libbpf strips all suffixes on both sides, all suffixes can be dropped. Fixes: 2dbbdeda77a6 ("sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibility") Cc: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-21sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibilityTejun Heo-5/+5
cded46d97159 ("sched_ext: Make scx_bpf_dsq_insert*() return bool") introduced a new bool-returning scx_bpf_dsq_insert() and renamed the old void-returning version to scx_bpf_dsq_insert___compat, with the expectation that libbpf would match old binaries to the ___compat variant, maintaining backward binary compatibility. However, while libbpf ignores ___suffix on the BPF side when matching symbols, it doesn't do so for kernel-side symbols. Old binaries compiled with the original scx_bpf_dsq_insert() could no longer resolve the symbol. Fix by reversing the naming: Keep scx_bpf_dsq_insert() as the old void-returning interface and add ___v2 to the new bool-returning version. This allows old binaries to continue working while new code can use the ___v2 variant. Once libbpf is updated to ignore kernel-side ___SUFFIX, the ___v2 suffix can be dropped when the compat interface is removed. v2: Use ___v2 instead of ___new. Fixes: cded46d97159 ("sched_ext: Make scx_bpf_dsq_insert*() return bool") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-15sched_ext: Add lockless peek operation for DSQsRyan Newton-0/+19
The builtin DSQ queue data structures are meant to be used by a wide range of different sched_ext schedulers with different demands on these data structures. They might be per-cpu with low-contention, or high-contention shared queues. Unfortunately, DSQs have a coarse-grained lock around the whole data structure. Without going all the way to a lock-free, more scalable implementation, a small step we can take to reduce lock contention is to allow a lockless, small-fixed-cost peek at the head of the queue. This change allows certain custom SCX schedulers to cheaply peek at queues, e.g. during load balancing, before locking them. But it represents a few extra memory operations to update the pointer each time the DSQ is modified, including a memory barrier on ARM so the write appears correctly ordered. This commit adds a first_task pointer field which is updated atomically when the DSQ is modified, and allows any thread to peek at the head of the queue without holding the lock. Signed-off-by: Ryan Newton <newton@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext/tools: Add compat wrapper for scx_bpf_task_set_slice/dsq_vtime()Tejun Heo-2/+24
for sub-scheduler authority checks. Add compat wrappers which fall back to direct p->scx field writes on older kernels. Suggested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Make scx_bpf_dsq_insert*() return boolTejun Heo-4/+22
In preparation for hierarchical schedulers, change scx_bpf_dsq_insert() and scx_bpf_dsq_insert_vtime() to return bool instead of void. With sub-schedulers, there will be no reliable way to guarantee a task is still owned by the sub-scheduler at insertion time (e.g., the task may have been migrated to another scheduler). The bool return value will enable sub-schedulers to detect and gracefully handle insertion failures. For the root scheduler, insertion failures will continue to trigger scheduler abort via scx_error(), so existing code doesn't need to check the return value. Backward compatibility is maintained through compat wrappers. Also update scx_bpf_dsq_move() documentation to clarify that it can return false for sub-schedulers when @dsq_id points to a disallowed local DSQ. Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Wrap kfunc args in struct to prepare for aux__progTejun Heo-3/+75
scx_bpf_dsq_insert_vtime() and scx_bpf_select_cpu_and() currently have 5 parameters. An upcoming change will add aux__prog parameter which will exceed BPF's 5 argument limit. Prepare by adding new kfuncs __scx_bpf_dsq_insert_vtime() and __scx_bpf_select_cpu_and() that take args structs. The existing kfuncs are kept as compatibility wrappers. BPF programs use inline wrappers that detect kernel API version via bpf_core_type_exists() and use the new struct-based kfuncs when available, falling back to compat kfuncs otherwise. This allows BPF programs to work with both old and new kernels. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13sched_ext: Add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime()Tejun Heo-0/+2
With the planned hierarchical scheduler support, sub-schedulers will need to be verified for authority before being allowed to modify task->scx.slice and task->scx.dsq_vtime. Add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() which will perform the necessary permission checks. Root schedulers can still directly write to these fields, so this doesn't affect existing schedulers. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13tools/sched_ext: Strip compatibility macros for cgroup and dispatch APIsTejun Heo-120/+12
Enough time has passed since the introduction of scx_bpf_task_cgroup() and the scx_bpf_dispatch* -> scx_bpf_dsq* kfunc renaming. Strip the compatibility macros. Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23tools/sched_ext: scx_qmap: Make debug output quieter by defaultTejun Heo-39/+53
scx_qmap currently outputs verbose debug messages including cgroup operations and CPU online/offline events by default, which can be noisy during normal operation. While the existing -P option controls DSQ dumps and event statistics, there's no way to suppress the other debug messages. Split the debug output controls to make scx_qmap quieter by default. The -P option continues to control DSQ dumps and event statistics (print_dsqs_and_events), while a new -M option controls debug messages like cgroup operations and CPU events (print_msgs). This allows users to run scx_qmap with minimal output and selectively enable debug information as needed. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Make qmap dump operation non-destructiveTejun Heo-1/+17
The qmap dump operation was destructively consuming queue entries while displaying them. As dump can be triggered anytime, this can easily lead to stalls. Add a temporary dump_store queue and modify the dump logic to pop entries, display them, and then restore them back to the original queue. This allows dump operations to be performed without affecting the scheduler's queue state. Note that if racing against new enqueues during dump, ordering can get mixed up, but this is acceptable for debugging purposes. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03tools/sched_ext: Add compat helper for scx_bpf_cpu_curr()Andrea Righi-1/+18
Introduce a compatibility helper that allows BPF schedulers to use scx_bpf_cpu_curr() on older kernels. Fixes: 20b158094a1ad ("sched_ext: Introduce scx_bpf_cpu_curr()") Cc: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: Introduce scx_bpf_cpu_curr()Christian Loehle-0/+1
Provide scx_bpf_cpu_curr() as a way for scx schedulers to check the curr task of a remote rq without assuming its lock is held. Many scx schedulers make use of scx_bpf_cpu_rq() to check a remote curr (e.g. to see if it should be preempted). This is problematic because scx_bpf_cpu_rq() provides access to all fields of struct rq, most of which aren't safe to use without holding the associated rq lock. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: Introduce scx_bpf_locked_rq()Christian Loehle-0/+1
Most fields in scx_bpf_cpu_rq() assume that its rq_lock is held. Furthermore they become meaningless without rq lock, too. Make a safer version of scx_bpf_cpu_rq() that only returns a rq if we hold rq lock of that rq. Also mark the new scx_bpf_locked_rq() as returning NULL as scx_bpf_cpu_rq() should've been too. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-11tools/sched_ext: Receive updates from SCX repoAndrea Righi-60/+388
Receive tools/sched_ext updates form https://github.com/sched-ext/scx to sync userspace bits: - basic BPF arena allocator abstractions, - additional process flags definitions, - fixed is_migration_disabled() helper, - separate out user_exit_info BPF and user space code. This also fixes the following warning when building the selftests: tools/sched_ext/include/scx/common.bpf.h:550:9: warning: 'likely' macro redefined [-Wmacro-redefined] 550 | #define likely(x) __builtin_expect(!!(x), 1) | ^ Co-developed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20sched_ext: Add support for cgroup bandwidth control interfaceTejun Heo-0/+23
From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 13 Jun 2025 15:06:47 -1000 - Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED. - Put bandwidth control interface files for both cgroup v1 and v2 under CONFIG_GROUP_SCHED_BANDWIDTH. - Update tg_bandwidth() to fetch configuration parameters from fair if CONFIG_CFS_BANDWIDTH, SCX otherwise. - Update tg_set_bandwidth() to update the parameters for both fair and SCX. - Add bandwidth control parameters to struct scx_cgroup_init_args. - Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth control parameter updates. - Update scx_qmap and maximal selftest to test the new feature. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18sched_ext: change the variable name for slice refill eventHonglei Wang-2/+2
SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs when the tasks were enqueued, which seems not accurate. What it actually means is the refilling with defalt slice, and this can occur either when enqueue or pick_task. Let's change the variable to SCX_EV_REFILL_SLICE_DFL. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-14sched_ext: Improve cross-compilation support in Makefileyangsonghua-6/+17
Modify the tools/sched_ext/Makefile to better handle cross-compilation environments by: 1. Fix host tools build directory structure by separating obj/ from output (HOST_BUILD_DIR now points to $(OBJ_DIR)/host/obj) 2. Properly propagate CROSS_COMPILE to libbpf sub-make invocation 3. Add missing $(HOST_BPFOBJ) build rule with proper host toolchain flags (ARCH=, CROSS_COMPILE=, explicit HOSTCC/HOSTLD) 4. Consistently quote $(HOSTCC) in bpftool build rule 5. Change LDFLAGS assignment to += to allow external extensions The changes ensure proper cross-compilation behavior while maintaining backward compatibility with native builds. Host tools are now correctly built with the host toolchain while target binaries use the cross-toolchain. Signed-off-by: yangsonghua <yangsonghua@lixiang.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08sched_ext: Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo-1/+1
Pull for-6.15-fixes to receive: e776b26e3701 ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings") which conflicts with: 1a7ff7216c8b ("sched_ext: Drop "ops" from scx_ops_enable_state and friends") The former removes code updated by the latter. Resolved by removing the updated section. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecationTejun Heo-1/+1
SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup weight support warnings. Now that the warnings are removed, the flag doesn't do anything. Mark it for deprecation and remove its usage from scx_flatcg. v2: Actually include the scx_flatcg update. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-and-reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-04-07sched_ext: idle: Introduce scx_bpf_select_cpu_and()Andrea Righi-0/+2
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply the built-in idle CPU selection policy to a subset of allowed CPU. This new helper is basically an extension of scx_bpf_select_cpu_dfl(). However, when an idle CPU can't be found, it returns a negative value instead of @prev_cpu, aligning its behavior more closely with scx_bpf_pick_idle_cpu(). It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying the built-in selection logic. With this helper, BPF schedulers can apply the built-in idle CPU selection policy restricted to any arbitrary subset of CPUs. Example usage ============= Possible usage in ops.select_cpu(): s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr; s32 cpu; cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0); if (cpu >= 0) { scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); return cpu; } return prev_cpu; } Results ======= Load distribution on a 4 sockets, 4 cores per socket system, simulated using virtme-ng, running a modified version of scx_bpfland that uses scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs: $ vng --cpu 16,sockets=4,cores=4,threads=1 ... $ stress-ng -c 16 ... $ htop ... 0[ 0.0%] 8[||||||||||||||||||||||||100.0%] 1[ 0.0%] 9[||||||||||||||||||||||||100.0%] 2[ 0.0%] 10[||||||||||||||||||||||||100.0%] 3[ 0.0%] 11[||||||||||||||||||||||||100.0%] 4[ 0.0%] 12[||||||||||||||||||||||||100.0%] 5[ 0.0%] 13[||||||||||||||||||||||||100.0%] 6[ 0.0%] 14[||||||||||||||||||||||||100.0%] 7[ 0.0%] 15[||||||||||||||||||||||||100.0%] With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all the available CPUs. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>