<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sched/ext.c, branch v6.17</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.17</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.17'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2025-09-16T20:15:23Z</updated>
<entry>
<title>Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"</title>
<updated>2025-09-16T20:15:23Z</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2025-09-12T16:14:38Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0b47b6c3543efd65f2e620e359b05f4938314fbd'/>
<id>urn:sha1:0b47b6c3543efd65f2e620e359b05f4938314fbd</id>
<content type='text'>
scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a
CPU is taken by a higher scheduling class to give tasks queued to the
CPU's local DSQ a chance to be migrated somewhere else, instead of
waiting indefinitely for that CPU to become available again.

In doing so, we decided to skip migration-disabled tasks, under the
assumption that they cannot be migrated anyway.

However, when a higher scheduling class preempts a CPU, the running task
is always inserted at the head of the local DSQ as a migration-disabled
task. This means it is always skipped by scx_bpf_reenqueue_local(), and
ends up being confined to the same CPU even if that CPU is heavily
contended by other higher scheduling class tasks.

As an example, let's consider the following scenario:

 $ schedtool -a 0,1, -e yes &gt; /dev/null
 $ sudo schedtool -F -p 99 -a 0, -e \
   stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000

The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task
(SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT
task initially runs on CPU0, it will remain there because it always sees
CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1%
utilization while CPU1 stays idle:

    0[||||||||||||||||||||||100.0%]   8[                        0.0%]
    1[                        0.0%]   9[                        0.0%]
    2[                        0.0%]  10[                        0.0%]
    3[                        0.0%]  11[                        0.0%]
    4[                        0.0%]  12[                        0.0%]
    5[                        0.0%]  13[                        0.0%]
    6[                        0.0%]  14[                        0.0%]
    7[                        0.0%]  15[                        0.0%]
  PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
 1067 root        RT   0  R   0  99.0  0.2  0:31.16 stress-ng-cpu [run]
  975 arighi      20   0  R   0   1.0  0.0  0:26.32 yes

By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled
tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in
this case) via ops.enqueue(), leading to better CPU utilization:

    0[||||||||||||||||||||||100.0%]   8[                        0.0%]
    1[||||||||||||||||||||||100.0%]   9[                        0.0%]
    2[                        0.0%]  10[                        0.0%]
    3[                        0.0%]  11[                        0.0%]
    4[                        0.0%]  12[                        0.0%]
    5[                        0.0%]  13[                        0.0%]
    6[                        0.0%]  14[                        0.0%]
    7[                        0.0%]  15[                        0.0%]
  PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
  577 root        RT   0  R   0 100.0  0.2  0:23.17 stress-ng-cpu [run]
  555 arighi      20   0  R   1 100.0  0.0  0:28.67 yes

It's debatable whether per-CPU tasks should be re-enqueued as well, but
doing so is probably safer: the scheduler can recognize re-enqueued
tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and
either put them back at the head of the local DSQ or let another task
attempt to take the CPU.

This also prevents giving per-CPU tasks an implicit priority boost,
which would otherwise make them more likely to reclaim CPUs preempted by
higher scheduling classes.

Fixes: 97e13ecb02668 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()")
Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Acked-by: Changwoo Min &lt;changwoo@igalia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/ext: Fix invalid task state transitions on class switch</title>
<updated>2025-08-11T16:56:37Z</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2025-08-05T08:59:11Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ddf7233fcab6c247379d0928d46cc316ee122229'/>
<id>urn:sha1:ddf7233fcab6c247379d0928d46cc316ee122229</id>
<content type='text'>
When enabling a sched_ext scheduler, we may trigger invalid task state
transitions, resulting in warnings like the following (which can be
easily reproduced by running the hotplug selftest in a loop):

 sched_ext: Invalid task state transition 0 -&gt; 3 for fish[770]
 WARNING: CPU: 18 PID: 787 at kernel/sched/ext.c:3862 scx_set_task_state+0x7c/0xc0
 ...
 RIP: 0010:scx_set_task_state+0x7c/0xc0
 ...
 Call Trace:
  &lt;TASK&gt;
  scx_enable_task+0x11f/0x2e0
  switching_to_scx+0x24/0x110
  scx_enable.isra.0+0xd14/0x13d0
  bpf_struct_ops_link_create+0x136/0x1a0
  __sys_bpf+0x1edd/0x2c30
  __x64_sys_bpf+0x21/0x30
  do_syscall_64+0xbb/0x370
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

This happens because we skip initialization for tasks that are already
dead (with their usage counter set to zero), but we don't exclude them
during the scheduling class transition phase.

Fix this by also skipping dead tasks during class swiching, preventing
invalid task state transitions.

Fixes: a8532fac7b5d2 ("sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext</title>
<updated>2025-07-31T23:29:46Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-07-31T23:29:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6a68cec16b647791d448102376a7eec2820e874f'/>
<id>urn:sha1:6a68cec16b647791d448102376a7eec2820e874f</id>
<content type='text'>
Pull sched_ext updates from Tejun Heo:

 - Add support for cgroup "cpu.max" interface

 - Code organization cleanup so that ext_idle.c doesn't depend on the
   source-file-inclusion build method of sched/

 - Drop UP paths in accordance with sched core changes

 - Documentation and other misc changes

* tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Fix scx_bpf_reenqueue_local() reference
  sched_ext: Drop kfuncs marked for removal in 6.15
  sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic
  kernel/sched/ext.c: fix typo "occured" -&gt; "occurred" in comments
  sched_ext: Add support for cgroup bandwidth control interface
  sched_ext, sched/core: Factor out struct scx_task_group
  sched_ext: Return NULL in llc_span
  sched_ext: Always use SMP versions in kernel/sched/ext_idle.h
  sched_ext: Always use SMP versions in kernel/sched/ext_idle.c
  sched_ext: Always use SMP versions in kernel/sched/ext.h
  sched_ext: Always use SMP versions in kernel/sched/ext.c
  sched_ext: Documentation: Clarify time slice handling in task lifecycle
  sched_ext: Make scx_locked_rq() inline
  sched_ext: Make scx_rq_bypassing() inline
  sched_ext: idle: Make local functions static in ext_idle.c
  sched_ext: idle: Remove unnecessary ifdef in scx_bpf_cpu_node()
</content>
</entry>
<entry>
<title>sched_ext: Fix scx_bpf_reenqueue_local() reference</title>
<updated>2025-07-17T18:17:26Z</updated>
<author>
<name>Christian Loehle</name>
<email>christian.loehle@arm.com</email>
</author>
<published>2025-07-08T16:12:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ae96bba1ca0000ebb3f3ced64c9367e2a223d69e'/>
<id>urn:sha1:ae96bba1ca0000ebb3f3ced64c9367e2a223d69e</id>
<content type='text'>
The comment mentions bpf_scx_reenqueue_local(), but the function
is provided for the BPF program implementing scx, as such the
naming convention is scx_bpf_reenqueue_local(), fix the comment.

Signed-off-by: Christian Loehle &lt;christian.loehle@arm.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/ext: Prevent update_locked_rq() calls with NULL rq</title>
<updated>2025-07-17T01:02:12Z</updated>
<author>
<name>Breno Leitao</name>
<email>leitao@debian.org</email>
</author>
<published>2025-07-16T17:38:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e14fd98c6d66cb76694b12c05768e4f9e8c95664'/>
<id>urn:sha1:e14fd98c6d66cb76694b12c05768e4f9e8c95664</id>
<content type='text'>
Avoid invoking update_locked_rq() when the runqueue (rq) pointer is NULL
in the SCX_CALL_OP and SCX_CALL_OP_RET macros.

Previously, calling update_locked_rq(NULL) with preemption enabled could
trigger the following warning:

    BUG: using __this_cpu_write() in preemptible [00000000]

This happens because __this_cpu_write() is unsafe to use in preemptible
context.

rq is NULL when an ops invoked from an unlocked context. In such cases, we
don't need to store any rq, since the value should already be NULL
(unlocked). Ensure that update_locked_rq() is only called when rq is
non-NULL, preventing calling __this_cpu_write() on preemptible context.

Suggested-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Fixes: 18853ba782bef ("sched_ext: Track currently locked rq")
Signed-off-by: Breno Leitao &lt;leitao@debian.org&gt;
Acked-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: stable@vger.kernel.org # v6.15
</content>
</entry>
<entry>
<title>sched_ext: Drop kfuncs marked for removal in 6.15</title>
<updated>2025-06-25T23:13:20Z</updated>
<author>
<name>Jake Hillion</name>
<email>jake@hillion.co.uk</email>
</author>
<published>2025-06-25T17:05:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4ecf83741401c70d4420588ee1f3b1ca04ef58d5'/>
<id>urn:sha1:4ecf83741401c70d4420588ee1f3b1ca04ef58d5</id>
<content type='text'>
sched_ext performed a kfunc renaming pass in 6.13 and kept the old names
around for compatibility with old binaries. These were scheduled for
cleanup in 6.15 but were missed. Submitting for cleanup in for-next.

Removed the kfuncs, their flags, and any references I could find to them
in doc comments. Left the entries in include/scx/compat.bpf.h as they're
still useful to make new binaries compatible with old kernels.

Tested by applying to my kernel. It builds and a modern version of
scx_lavd loads fine.

Signed-off-by: Jake Hillion &lt;jake@hillion.co.uk&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic</title>
<updated>2025-06-24T23:05:26Z</updated>
<author>
<name>David Dai</name>
<email>david.dai@linux.dev</email>
</author>
<published>2025-06-24T22:49:06Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=cb444006a625c60e6d4dd3753863c3c74f96aac3'/>
<id>urn:sha1:cb444006a625c60e6d4dd3753863c3c74f96aac3</id>
<content type='text'>
For systems using a sched_ext scheduler and has panic_on_rcu_stall
enabled, try kicking out the current scheduler before issuing a panic.

While there are numerous reasons for RCU CPU stalls that are not
directly attributed to the scheduler, deferring the panic gives
sched_ext an opportunity to provide additional debug info when ejecting
the current scheduler. Also, handling the event more gracefully allows
us to potentially recover the system instead of incurring additional
down time.

Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Signed-off-by: David Dai &lt;david.dai@linux.dev&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>kernel/sched/ext.c: fix typo "occured" -&gt; "occurred" in comments</title>
<updated>2025-06-23T18:11:16Z</updated>
<author>
<name>Ke Ma</name>
<email>makebit1999@gmail.com</email>
</author>
<published>2025-06-19T21:11:28Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e2a37c277c64078d5439693963fb9813fa1e6e9c'/>
<id>urn:sha1:e2a37c277c64078d5439693963fb9813fa1e6e9c</id>
<content type='text'>
Fixes a minor spelling mistake in two comment lines

Signed-off-by: Ke Ma &lt;makebit1999@gmail.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Add support for cgroup bandwidth control interface</title>
<updated>2025-06-21T03:03:51Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-06-14T01:34:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ddceadce63d9cb752c2472e220ded05cabaf7971'/>
<id>urn:sha1:ddceadce63d9cb752c2472e220ded05cabaf7971</id>
<content type='text'>
From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001
From: Tejun Heo &lt;tj@kernel.org&gt;
Date: Fri, 13 Jun 2025 15:06:47 -1000

- Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both
  CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED.

- Put bandwidth control interface files for both cgroup v1 and v2 under
  CONFIG_GROUP_SCHED_BANDWIDTH.

- Update tg_bandwidth() to fetch configuration parameters from fair if
  CONFIG_CFS_BANDWIDTH, SCX otherwise.

- Update tg_set_bandwidth() to update the parameters for both fair and SCX.

- Add bandwidth control parameters to struct scx_cgroup_init_args.

- Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth
  control parameter updates.

- Update scx_qmap and maximal selftest to test the new feature.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext, sched/core: Factor out struct scx_task_group</title>
<updated>2025-06-21T03:03:27Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-06-14T01:33:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6e6558a6bc418f1478c5dc8609d03805364e0cb9'/>
<id>urn:sha1:6e6558a6bc418f1478c5dc8609d03805364e0cb9</id>
<content type='text'>
More sched_ext fields will be added to struct task_group. In preparation,
factor out sched_ext fields into struct scx_task_group to reduce clutter in
the common header. No functional changes.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
</feed>
