<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sched/ext.c, branch v6.14</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.14</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.14'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2025-03-03T17:55:48Z</updated>
<entry>
<title>sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()</title>
<updated>2025-03-03T17:55:48Z</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2025-03-03T17:51:59Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9360dfe4cbd62ff1eb8217b815964931523b75b3'/>
<id>urn:sha1:9360dfe4cbd62ff1eb8217b815964931523b75b3</id>
<content type='text'>
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids
range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel
crash.

To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and
trigger an scx error if an invalid CPU is specified.

Fixes: f0e1a0643a59b ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()</title>
<updated>2025-02-25T18:28:52Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-02-25T16:02:23Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8fef0a3b17bb258130a4fcbcb5addf94b25e9ec5'/>
<id>urn:sha1:8fef0a3b17bb258130a4fcbcb5addf94b25e9ec5</id>
<content type='text'>
a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called
without preceding balance_scx()") added a workaround to handle the cases
where pick_task_scx() is called without prececing balance_scx() which is due
to a fair class bug where pick_taks_fair() may return NULL after a true
return from balance_fair().

The workaround detects when pick_task_scx() is called without preceding
balance_scx() and emulates SCX_RQ_BAL_KEEP and triggers kicking to avoid
stalling. Unfortunately, the workaround code was testing whether @prev was
on SCX to decide whether to keep the task running. This is incorrect as the
task may be on SCX but no longer runnable.

This could lead to a non-runnable task to be returned from pick_task_scx()
which cause interesting confusions and failures. e.g. A common failure mode
is the task ending up with (!on_rq &amp;&amp; on_cpu) state which can cause
potential wakers to busy loop, which can easily lead to deadlocks.

Fix it by testing whether @prev has SCX_TASK_QUEUED set. This makes
@prev_on_scx only used in one place. Open code the usage and improve the
comment while at it.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Pat Cody &lt;patcody@meta.com&gt;
Fixes: a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</content>
</entry>
<entry>
<title>sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx</title>
<updated>2025-02-13T16:57:33Z</updated>
<author>
<name>Chuyi Zhou</name>
<email>zhouchuyi@bytedance.com</email>
</author>
<published>2025-02-12T13:09:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f5717c93a1b999970f3a64d771a1a9ee68cc37d0'/>
<id>urn:sha1:f5717c93a1b999970f3a64d771a1a9ee68cc37d0</id>
<content type='text'>
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of
the current task, the following error will occur:

scx_foo[3795244] triggered exit kind 1024:
  runtime error (called on a task not being operated on)

The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK()
when calling ops.tick(), which triggers the error during the subsequent
scx_kf_allowed_on_arg_tasks() check.

SCX_CALL_OP_TASK() was first introduced in commit 36454023f50b ("sched_ext:
Track tasks that are subjects of the in-flight SCX operation") to ensure
task's rq lock is held when accessing task's sched_group. Since ops.tick()
is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq
lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly,
the same changes should be made for ops.disable() and ops.exit_task(), as
they are also protected by task_rq_lock() and it's safe to access the
task's task_group.

Fixes: 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation")
Signed-off-by: Chuyi Zhou &lt;zhouchuyi@bytedance.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()</title>
<updated>2025-02-10T20:40:47Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-02-10T19:27:09Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f3f08c3acfb8860e07a22814a344e83c99ad7398'/>
<id>urn:sha1:f3f08c3acfb8860e07a22814a344e83c99ad7398</id>
<content type='text'>
While fixing migration disabled task handling, 32966821574c ("sched_ext: Fix
migration disabled handling in targeted dispatches") assumed that a
migration disabled task's -&gt;cpus_ptr would only have the pinned CPU. While
this is eventually true for migration disabled tasks that are switched out,
-&gt;cpus_ptr update is performed by migrate_disable_switch() which is called
right before context_switch() in __scheduler(). However, the task is
enqueued earlier during pick_next_task() via put_prev_task_scx(), so there
is a race window where another CPU can see the task on a DSQ.

If the CPU tries to dispatch the migration disabled task while in that
window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq()
will subsequently trigger SCHED_WARN(is_migration_disabled()).

  WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140
  Sched_ext: layered (enabled+all), task: runnable_at=-10ms
  RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140
  ...
   &lt;TASK&gt;
   consume_dispatch_q+0xab/0x220
   scx_bpf_dsq_move_to_local+0x58/0xd0
   bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa
   bpf__sched_ext_ops_dispatch+0x4b/0xab
   balance_one+0x1fe/0x3b0
   balance_scx+0x61/0x1d0
   prev_balance+0x46/0xc0
   __pick_next_task+0x73/0x1c0
   __schedule+0x206/0x1730
   schedule+0x3a/0x160
   __do_sys_sched_yield+0xe/0x20
   do_syscall_64+0xbb/0x1e0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fix it by converting the SCHED_WARN() back to a regular failure path. Also,
perform the migration disabled test before task_allowed_on_cpu() test so
that BPF schedulers which fail to handle migration disabled tasks can be
noticed easily.

While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case
for brevity and consistency.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches")
Acked-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Reported-by: Jake Hillion &lt;jakehillion@meta.com&gt;
</content>
</entry>
<entry>
<title>sched_ext: Fix migration disabled handling in targeted dispatches</title>
<updated>2025-02-09T06:33:25Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-02-07T20:59:06Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=32966821574cd2917bd60f2554f435fe527f4702'/>
<id>urn:sha1:32966821574cd2917bd60f2554f435fe527f4702</id>
<content type='text'>
A dispatch operation that can target a specific local DSQ -
scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task
can be migrated to the target CPU using task_can_run_on_remote_rq(). If the
task can't be migrated to the targeted CPU, it is bounced through a global
DSQ.

task_can_run_on_remote_rq() assumes that the task is on a CPU that's
different from the targeted CPU but the callers doesn't uphold the
assumption and may call the function when the task is already on the target
CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends
up returning %false incorrectly unnecessarily bouncing the task to a global
DSQ.

Fix it by updating the callers to only call task_can_run_on_remote_rq() when
the task is on a different CPU than the target CPU. As this is a bit subtle,
for clarity and documentation:

- Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on
  the same CPU as the target CPU.

- is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger
  if the task is on a different CPU than the target CPU as the preceding
  task_allowed_on_cpu() test should fail beforehand. Convert the test into
  SCHED_WARN_ON().

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Fixes: 0366017e0973 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()")
Cc: stable@vger.kernel.org # v6.12+
</content>
</entry>
<entry>
<title>sched_ext: Implement auto local dispatching of migration disabled tasks</title>
<updated>2025-02-09T06:32:54Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-02-07T20:58:23Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2fa0fbeb69edd367b7c44f484e8dc5a5a1a311ef'/>
<id>urn:sha1:2fa0fbeb69edd367b7c44f484e8dc5a5a1a311ef</id>
<content type='text'>
Migration disabled tasks are special and pinned to their previous CPUs. They
tripped up some unsuspecting BPF schedulers as their -&gt;nr_cpus_allowed may
not agree with the bits set in -&gt;cpus_ptr. Make it easier for BPF schedulers
by automatically dispatching them to the pinned local DSQs by default. If a
BPF scheduler wants to handle migration disabled tasks explicitly, it can
set SCX_OPS_ENQ_MIGRATION_DISABLED.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</content>
</entry>
<entry>
<title>sched_ext: Fix lock imbalance in dispatch_to_local_dsq()</title>
<updated>2025-01-27T22:41:12Z</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2025-01-27T22:06:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1626e5ef0b00386a4fd083fa7c46c8edbd75f9b4'/>
<id>urn:sha1:1626e5ef0b00386a4fd083fa7c46c8edbd75f9b4</id>
<content type='text'>
While performing the rq locking dance in dispatch_to_local_dsq(), we may
trigger the following lock imbalance condition, in particular when
multiple tasks are rapidly changing CPU affinity (i.e., running a
`stress-ng --race-sched 0`):

[   13.413579] =====================================
[   13.413660] WARNING: bad unlock balance detected!
[   13.413729] 6.13.0-virtme #15 Not tainted
[   13.413792] -------------------------------------
[   13.413859] kworker/1:1/80 is trying to release lock (&amp;rq-&gt;__lock) at:
[   13.413954] [&lt;ffffffff873c6c48&gt;] dispatch_to_local_dsq+0x108/0x1a0
[   13.414111] but there are no more locks to release!
[   13.414176]
[   13.414176] other info that might help us debug this:
[   13.414258] 1 lock held by kworker/1:1/80:
[   13.414318]  #0: ffff8b66feb41698 (&amp;rq-&gt;__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[   13.414612]
[   13.414612] stack backtrace:
[   13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
[   13.415505] Workqueue:  0x0 (events)
[   13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
[   13.415570] Call Trace:
[   13.415700]  &lt;TASK&gt;
[   13.415744]  dump_stack_lvl+0x78/0xe0
[   13.415806]  ? dispatch_to_local_dsq+0x108/0x1a0
[   13.415884]  print_unlock_imbalance_bug+0x11b/0x130
[   13.415965]  ? dispatch_to_local_dsq+0x108/0x1a0
[   13.416226]  lock_release+0x231/0x2c0
[   13.416326]  _raw_spin_unlock+0x1b/0x40
[   13.416422]  dispatch_to_local_dsq+0x108/0x1a0
[   13.416554]  flush_dispatch_buf+0x199/0x1d0
[   13.416652]  balance_one+0x194/0x370
[   13.416751]  balance_scx+0x61/0x1e0
[   13.416848]  prev_balance+0x43/0xb0
[   13.416947]  __pick_next_task+0x6b/0x1b0
[   13.417052]  __schedule+0x20d/0x1740

This happens because dispatch_to_local_dsq() is racing with
dispatch_dequeue() and, when the latter wins, we incorrectly assume that
the task has been moved to dst_rq.

Fix by properly tracking the currently locked rq.

Fixes: 4d3ca89bdd31 ("sched_ext: Refactor consume_remote_task()")
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Fix incorrect autogroup migration detection</title>
<updated>2025-01-27T18:31:50Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-01-24T22:22:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d6f3e7d564b2309e1f17e709a70eca78d7ca2bb8'/>
<id>urn:sha1:d6f3e7d564b2309e1f17e709a70eca78d7ca2bb8</id>
<content type='text'>
scx_move_task() is called from sched_move_task() and tells the BPF scheduler
that cgroup migration is being committed. sched_move_task() is used by both
cgroup and autogroup migrations and scx_move_task() tried to filter out
autogroup migrations by testing the destination cgroup and PF_EXITING but
this is not enough. In fact, without explicitly tagging the thread which is
doing the cgroup migration, there is no good way to tell apart
scx_move_task() invocations for racing migration to the root cgroup and an
autogroup migration.

This led to scx_move_task() incorrectly ignoring a migration from non-root
cgroup to an autogroup of the root cgroup triggering the following warning:

  WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340
  ...
  Call Trace:
  &lt;TASK&gt;
    cgroup_migrate_execute+0x5b1/0x700
    cgroup_attach_task+0x296/0x400
    __cgroup_procs_write+0x128/0x140
    cgroup_procs_write+0x17/0x30
    kernfs_fop_write_iter+0x141/0x1f0
    vfs_write+0x31d/0x4a0
    __x64_sys_write+0x72/0xf0
    do_syscall_64+0x82/0x160
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix it by adding an argument to sched_move_task() that indicates whether the
moving is for a cgroup or autogroup migration. After the change,
scx_move_task() is called only for cgroup migrations and renamed to
scx_cgroup_move_task().

Link: https://github.com/sched-ext/scx/issues/370
Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Include task weight in the error state dump</title>
<updated>2025-01-24T18:07:42Z</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2025-01-22T09:05:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2279563e3a8cac367b267b09c15cf1e39c06c5cc'/>
<id>urn:sha1:2279563e3a8cac367b267b09c15cf1e39c06c5cc</id>
<content type='text'>
Report the task weight when dumping the task state during an error exit.
Moreover, adjust the output format to display dsq_vtime, slice, and
weight on the same line.

This can help identify whether certain tasks were excessively
prioritized or de-prioritized due to large niceness gaps.

Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Fixes typos in comments</title>
<updated>2025-01-24T18:06:05Z</updated>
<author>
<name>Atul Kumar Pant</name>
<email>atulpant.linux@gmail.com</email>
</author>
<published>2025-01-18T08:39:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=be8ee18152b0523752f3a44900363838bd1573bb'/>
<id>urn:sha1:be8ee18152b0523752f3a44900363838bd1573bb</id>
<content type='text'>
Fixes some spelling errors in the comments.

Signed-off-by: Atul Kumar Pant &lt;atulpant.linux@gmail.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
</feed>
