<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sched, branch v6.13</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.13</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.13'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2025-01-19T17:01:17Z</updated>
<entry>
<title>Merge tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2025-01-19T17:01:17Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-01-19T17:01:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8ff6d472ab35d5cb9a3941a1fcd5b7cbc9338c7f'/>
<id>urn:sha1:8ff6d472ab35d5cb9a3941a1fcd5b7cbc9338c7f</id>
<content type='text'>
Pull scheduler fixes from Borislav Petkov:

 - Do not adjust the weight of empty group entities and avoid
   scheduling artifacts

 - Avoid scheduling lag by computing lag properly and thus address
   an EEVDF entity placement issue

* tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
  sched/fair: Fix EEVDF entity placement bug causing scheduling lag
</content>
</entry>
<entry>
<title>sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE</title>
<updated>2025-01-13T12:50:56Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2025-01-13T12:50:11Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=66951e4860d3c688bfa550ea4a19635b57e00eca'/>
<id>urn:sha1:66951e4860d3c688bfa550ea4a19635b57e00eca</id>
<content type='text'>
Normally dequeue_entities() will continue to dequeue an empty group entity;
except DELAY_DEQUEUE changes things -- it retains empty entities such that they
might continue to compete and burn off some lag.

However, doing this results in update_cfs_group() re-computing the cgroup
weight 'slice' for an empty group, which it (rightly) figures isn't much at
all. This in turn means that the delayed entity is not competing at the
expected weight. Worse, the very low weight causes its lag to be inflated,
which combined with avg_vruntime() using scale_load_down(), leads to artifacts.

As such, don't adjust the weight for empty group entities and let them compete
at their original weight.

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net
</content>
</entry>
<entry>
<title>Merge tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext</title>
<updated>2025-01-10T23:11:58Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-01-10T23:11:58Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2e3f3090bd8bf61633b36ae3b13d4a6e777f182a'/>
<id>urn:sha1:2e3f3090bd8bf61633b36ae3b13d4a6e777f182a</id>
<content type='text'>
Pull sched_ext fixes from Tejun Heo:

 - Fix corner case bug where ops.dispatch() couldn't extend the
   execution of the current task if SCX_OPS_ENQ_LAST is set.

 - Fix ops.cpu_release() not being called when a SCX task is preempted
   by a higher priority sched class task.

 - Fix buitin idle mask being incorrectly left as busy after an idle CPU
   is picked and kicked.

 - scx_ops_bypass() was unnecessarily using rq_lock() which comes with
   rq pinning related sanity checks which could trigger spuriously.
   Switch to raw_spin_rq_lock().

* tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: idle: Refresh idle masks during idle-to-idle transitions
  sched_ext: switch class when preempted by higher priority scheduler
  sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()
  sched_ext: keep running prev when prev-&gt;scx.slice != 0
</content>
</entry>
<entry>
<title>sched_ext: idle: Refresh idle masks during idle-to-idle transitions</title>
<updated>2025-01-10T22:40:42Z</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2025-01-10T22:16:31Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a2a3374c47c428c0edb0bbc693638d4783f81e31'/>
<id>urn:sha1:a2a3374c47c428c0edb0bbc693638d4783f81e31</id>
<content type='text'>
With the consolidation of put_prev_task/set_next_task(), see
commit 436f3eed5c69 ("sched: Combine the last put_prev_task() and the
first set_next_task()"), we are now skipping the transition between
these two functions when the previous and the next tasks are the same.

As a result, the scx idle state of a CPU is updated only when
transitioning to or from the idle thread. While this is generally
correct, it can lead to uneven and inefficient core utilization in
certain scenarios [1].

A typical scenario involves proactive wake-ups: scx_bpf_pick_idle_cpu()
selects and marks an idle CPU as busy, followed by a wake-up via
scx_bpf_kick_cpu(), without dispatching any tasks. In this case, the CPU
continues running the idle thread, returns to idle, but remains marked
as busy, preventing it from being selected again as an idle CPU (until a
task eventually runs on it and releases the CPU).

For example, running a workload that uses 20% of each CPU, combined with
an scx scheduler using proactive wake-ups, results in the following core
utilization:

 CPU 0: 25.7%
 CPU 1: 29.3%
 CPU 2: 26.5%
 CPU 3: 25.5%
 CPU 4:  0.0%
 CPU 5: 25.5%
 CPU 6:  0.0%
 CPU 7: 10.5%

To address this, refresh the idle state also in pick_task_idle(), during
idle-to-idle transitions, but only trigger ops.update_idle() on actual
state changes to prevent unnecessary updates to the scx scheduler and
maintain balanced state transitions.

With this change in place, the core utilization in the previous example
becomes the following:

 CPU 0: 18.8%
 CPU 1: 19.4%
 CPU 2: 18.0%
 CPU 3: 18.7%
 CPU 4: 19.3%
 CPU 5: 18.9%
 CPU 6: 18.7%
 CPU 7: 19.3%

[1] https://github.com/sched-ext/scx/pull/1139

Fixes: 7c65ae81ea86 ("sched_ext: Don't call put_prev_task_scx() before picking the next task")
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/fair: Fix EEVDF entity placement bug causing scheduling lag</title>
<updated>2025-01-09T11:55:27Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2025-01-09T10:59:59Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6d71a9c6160479899ee744d2c6d6602a191deb1f'/>
<id>urn:sha1:6d71a9c6160479899ee744d2c6d6602a191deb1f</id>
<content type='text'>
I noticed this in my traces today:

       turbostat-1222    [006] d..2.   311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } -&gt;
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 }
       turbostat-1222    [006] d..2.   311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } -&gt;
                               { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 }

Which is a weight transition: 1048576 -&gt; 2 -&gt; 1048576.

One would expect the lag to shoot out *AND* come back, notably:

  -1123*1048576/2 = -588775424
  -588775424*2/1048576 = -1123

Except the trace shows it is all off. Worse, subsequent cycles shoot it
out further and further.

This made me have a very hard look at reweight_entity(), and
specifically the -&gt;on_rq case, which is more prominent with
DELAY_DEQUEUE.

And indeed, it is all sorts of broken. While the computation of the new
lag is correct, the computation for the new vruntime, using the new lag
is broken for it does not consider the logic set out in place_entity().

With the below patch, I now see things like:

    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } -&gt;
                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 }
    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } -&gt;
                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 }

Which isn't perfect yet, but much closer.

Reported-by: Doug Smythies &lt;dsmythies@telus.net&gt;
Reported-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Tested-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
Link: https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net
</content>
</entry>
<entry>
<title>sched_ext: switch class when preempted by higher priority scheduler</title>
<updated>2025-01-08T16:51:40Z</updated>
<author>
<name>Honglei Wang</name>
<email>jameshongleiwang@126.com</email>
</author>
<published>2025-01-08T02:33:28Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=68e449d849fd50bd5e61d8bd32b3458dbd3a3df6'/>
<id>urn:sha1:68e449d849fd50bd5e61d8bd32b3458dbd3a3df6</id>
<content type='text'>
ops.cpu_release() function, if defined, must be invoked when preempted by
a higher priority scheduler class task. This scenario was skipped in
commit f422316d7466 ("sched_ext: Remove switch_class_scx()"). Let's fix
it.

Fixes: f422316d7466 ("sched_ext: Remove switch_class_scx()")
Signed-off-by: Honglei Wang &lt;jameshongleiwang@126.com&gt;
Acked-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()</title>
<updated>2025-01-08T16:48:53Z</updated>
<author>
<name>Changwoo Min</name>
<email>changwoo@igalia.com</email>
</author>
<published>2025-01-08T15:08:06Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6268d5bc10354fc2ab8d44a0cd3b042d49a0417e'/>
<id>urn:sha1:6268d5bc10354fc2ab8d44a0cd3b042d49a0417e</id>
<content type='text'>
scx_ops_bypass() iterates all CPUs to re-enqueue all the scx tasks.
For each CPU, it acquires a lock using rq_lock() regardless of whether
a CPU is offline or the CPU is currently running a task in a higher
scheduler class (e.g., deadline). The rq_lock() is supposed to be used
for online CPUs, and the use of rq_lock() may trigger an unnecessary
warning in rq_pin_lock(). Therefore, replace rq_lock() to
raw_spin_rq_lock() in scx_ops_bypass().

Without this change, we observe the following warning:

===== START =====
[    6.615205] rq-&gt;balance_callback &amp;&amp; rq-&gt;balance_callback != &amp;balance_push_callback
[    6.615208] WARNING: CPU: 2 PID: 0 at kernel/sched/sched.h:1730 __schedule+0x1130/0x1c90
=====  END  =====

Fixes: 0e7ffff1b811 ("scx: Fix raciness in scx_ops_bypass()")
Signed-off-by: Changwoo Min &lt;changwoo@igalia.com&gt;
Acked-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: keep running prev when prev-&gt;scx.slice != 0</title>
<updated>2025-01-08T16:48:33Z</updated>
<author>
<name>Henry Huang</name>
<email>henry.hj@antgroup.com</email>
</author>
<published>2025-01-08T08:47:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=30dd3b13f9de612ef7328ccffcf1a07d0d40ab51'/>
<id>urn:sha1:30dd3b13f9de612ef7328ccffcf1a07d0d40ab51</id>
<content type='text'>
When %SCX_OPS_ENQ_LAST is set and prev-&gt;scx.slice != 0,
@prev will be dispacthed into the local DSQ in put_prev_task_scx().
However, pick_task_scx() is executed before put_prev_task_scx(),
so it will not pick @prev.
Set %SCX_RQ_BAL_KEEP in balance_one() to ensure that pick_task_scx()
can pick @prev.

Signed-off-by: Henry Huang &lt;henry.hj@antgroup.com&gt;
Acked-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext</title>
<updated>2025-01-03T23:09:12Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-01-03T23:09:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=63676eefb7a026d04b51dcb7aaf54f358517a2ec'/>
<id>urn:sha1:63676eefb7a026d04b51dcb7aaf54f358517a2ec</id>
<content type='text'>
Pull sched_ext fixes from Tejun Heo:

 - Fix a bug where bpf_iter_scx_dsq_new() was not initializing the
   iterator's flags and could inadvertently enable e.g. reverse
   iteration

 - Fix a bug where scx_ops_bypass() could call irq_restore twice

 - Add Andrea and Changwoo as maintainers for better review coverage

 - selftests and tools/sched_ext build and other fixes

* tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Fix dsq_local_on selftest
  sched_ext: initialize kit-&gt;cursor.flags
  sched_ext: Fix invalid irq restore in scx_ops_bypass()
  MAINTAINERS: add me as reviewer for sched_ext
  MAINTAINERS: add self as reviewer for sched_ext
  scx: Fix maximal BPF selftest prog
  sched_ext: fix application of sizeof to pointer
  selftests/sched_ext: fix build after renames in sched_ext API
  sched_ext: Add __weak to fix the build errors
</content>
</entry>
<entry>
<title>sched_ext: initialize kit-&gt;cursor.flags</title>
<updated>2024-12-24T20:56:08Z</updated>
<author>
<name>Henry Huang</name>
<email>henry.hj@antgroup.com</email>
</author>
<published>2024-12-22T15:43:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=35bf430e08a18fdab6eb94492a06d9ad14c6179b'/>
<id>urn:sha1:35bf430e08a18fdab6eb94492a06d9ad14c6179b</id>
<content type='text'>
struct bpf_iter_scx_dsq *it maybe not initialized.
If we didn't call scx_bpf_dsq_move_set_vtime and scx_bpf_dsq_move_set_slice
before scx_bpf_dsq_move, it would cause unexpected behaviors:
1. Assign a huge slice into p-&gt;scx.slice
2. Assign a invalid vtime into p-&gt;scx.dsq_vtime

Signed-off-by: Henry Huang &lt;henry.hj@antgroup.com&gt;
Fixes: 6462dd53a260 ("sched_ext: Compact struct bpf_iter_scx_dsq_kern")
Cc: stable@vger.kernel.org # v6.12
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
</feed>
