<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sched, branch v6.16</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.16</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.16'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2025-07-20T18:08:51Z</updated>
<entry>
<title>Merge tag 'sched-urgent-2025-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2025-07-20T18:08:51Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-07-20T18:08:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=62347e279092ae704877467abdc8533e914f945e'/>
<id>urn:sha1:62347e279092ae704877467abdc8533e914f945e</id>
<content type='text'>
Pull scheduler fix from Thomas Gleixner:
 "A single fix for the scheduler.

  A recent commit changed the runqueue counter nr_uninterruptible to an
  unsigned int. Due to the fact that the counters are not updated on
  migration of a uninterruptble task to a different CPU, these counters
  can exceed INT_MAX.

  The counter is cast to long in the load average calculation, which
  means that the cast expands into negative space resulting in bogus
  load average values.

  Convert it back to unsigned long to fix this.

* tag 'sched-urgent-2025-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Change nr_uninterruptible type to unsigned long
</content>
</entry>
<entry>
<title>Merge tag 'sched_ext-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext</title>
<updated>2025-07-19T17:40:30Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-07-19T17:40:30Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=bf61759db409ce21a8f2a5bb442b7c35905a713d'/>
<id>urn:sha1:bf61759db409ce21a8f2a5bb442b7c35905a713d</id>
<content type='text'>
Pull sched_ext fixes from Tejun Heo:

 - Fix handling of migration disabled tasks in default idle selection

 - update_locked_rq() called __this_cpu_write() spuriously with NULL
   when @rq was not locked. As the writes were spurious, it didn't break
   anything directly. However, the function could be called in a
   preemptible leading to a context warning in __this_cpu_write(). Skip
   the spurious NULL writes.

 - Selftest fix on UP

* tag 'sched_ext-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: idle: Handle migration-disabled tasks in idle selection
  sched/ext: Prevent update_locked_rq() calls with NULL rq
  selftests/sched_ext: Fix exit selftest hang on UP
</content>
</entry>
<entry>
<title>sched_ext: idle: Handle migration-disabled tasks in idle selection</title>
<updated>2025-07-17T18:19:38Z</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2025-07-05T05:43:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=06efc9fe0b8deeb83b47fd7c5451fe1a60c8a761'/>
<id>urn:sha1:06efc9fe0b8deeb83b47fd7c5451fe1a60c8a761</id>
<content type='text'>
When SCX_OPS_ENQ_MIGRATION_DISABLED is enabled, migration-disabled tasks
are also routed to ops.enqueue(). A scheduler may attempt to dispatch
such tasks directly to an idle CPU using the default idle selection
policy via scx_bpf_select_cpu_and() or scx_bpf_select_cpu_dfl().

This scenario must be properly handled by the built-in idle policy to
avoid returning an idle CPU where the target task isn't allowed to run.
Otherwise, it can lead to errors such as:

 EXIT: runtime error (SCX_DSQ_LOCAL[_ON] cannot move migration disabled Chrome_ChildIOT[291646] from CPU 3 to 14)

Prevent this by explicitly handling migration-disabled tasks in the
built-in idle selection logic, maintaining their CPU affinity.

Fixes: a730e3f7a48bc ("sched_ext: idle: Consolidate default idle CPU selection kfuncs")
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/ext: Prevent update_locked_rq() calls with NULL rq</title>
<updated>2025-07-17T01:02:12Z</updated>
<author>
<name>Breno Leitao</name>
<email>leitao@debian.org</email>
</author>
<published>2025-07-16T17:38:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e14fd98c6d66cb76694b12c05768e4f9e8c95664'/>
<id>urn:sha1:e14fd98c6d66cb76694b12c05768e4f9e8c95664</id>
<content type='text'>
Avoid invoking update_locked_rq() when the runqueue (rq) pointer is NULL
in the SCX_CALL_OP and SCX_CALL_OP_RET macros.

Previously, calling update_locked_rq(NULL) with preemption enabled could
trigger the following warning:

    BUG: using __this_cpu_write() in preemptible [00000000]

This happens because __this_cpu_write() is unsafe to use in preemptible
context.

rq is NULL when an ops invoked from an unlocked context. In such cases, we
don't need to store any rq, since the value should already be NULL
(unlocked). Ensure that update_locked_rq() is only called when rq is
non-NULL, preventing calling __this_cpu_write() on preemptible context.

Suggested-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Fixes: 18853ba782bef ("sched_ext: Track currently locked rq")
Signed-off-by: Breno Leitao &lt;leitao@debian.org&gt;
Acked-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: stable@vger.kernel.org # v6.15
</content>
</entry>
<entry>
<title>sched: Change nr_uninterruptible type to unsigned long</title>
<updated>2025-07-14T08:59:31Z</updated>
<author>
<name>Aruna Ramakrishna</name>
<email>aruna.ramakrishna@oracle.com</email>
</author>
<published>2025-07-09T17:33:28Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=36569780b0d64de283f9d6c2195fd1a43e221ee8'/>
<id>urn:sha1:36569780b0d64de283f9d6c2195fd1a43e221ee8</id>
<content type='text'>
The commit e6fe3f422be1 ("sched: Make multiple runqueue task counters
32-bit") changed nr_uninterruptible to an unsigned int. But the
nr_uninterruptible values for each of the CPU runqueues can grow to
large numbers, sometimes exceeding INT_MAX. This is valid, if, over
time, a large number of tasks are migrated off of one CPU after going
into an uninterruptible state. Only the sum of all nr_interruptible
values across all CPUs yields the correct result, as explained in a
comment in kernel/sched/loadavg.c.

Change the type of nr_uninterruptible back to unsigned long to prevent
overflows, and thus the miscalculation of load average.

Fixes: e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit")

Signed-off-by: Aruna Ramakrishna &lt;aruna.ramakrishna@oracle.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20250709173328.606794-1-aruna.ramakrishna@oracle.com
</content>
</entry>
<entry>
<title>Revert "sched/numa: add statistics of numa balance task"</title>
<updated>2025-07-10T04:07:56Z</updated>
<author>
<name>Chen Yu</name>
<email>yu.c.chen@intel.com</email>
</author>
<published>2025-07-04T13:56:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=db6cc3f4ac2e6cdc898fc9cbc8b32ae1bf56bdad'/>
<id>urn:sha1:db6cc3f4ac2e6cdc898fc9cbc8b32ae1bf56bdad</id>
<content type='text'>
This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382.

This commit introduces per-memcg/task NUMA balance statistics, but
unfortunately it introduced a NULL pointer exception due to the following
race condition: After a swap task candidate was chosen, its mm_struct
pointer was set to NULL due to task exit.  Later, when performing the
actual task swapping, the p-&gt;mm caused the problem.

CPU0                                   CPU1
:
...
task_numa_migrate
     task_numa_find_cpu
      task_numa_compare
        # a normal task p is chosen
        env-&gt;best_task = p

                                          # p exit:
                                          exit_signals(p);
                                             p-&gt;flags |= PF_EXITING
                                          exit_mm
                                             p-&gt;mm = NULL;

      migrate_swap_stop
        __migrate_swap_task((arg-&gt;src_task, arg-&gt;dst_cpu)
         count_memcg_event_mm(p-&gt;mm, NUMA_TASK_SWAP)# p-&gt;mm is NULL

task_lock() should be held and the PF_EXITING flag needs to be checked to
prevent this from happening.  After discussion, the conclusion was that
adding a lock is not worthwhile for some statistics calculations.  Revert
the change and rely on the tracepoint for this purpose.

Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org
Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task")
Signed-off-by: Chen Yu &lt;yu.c.chen@intel.com&gt;
Reported-by: Jirka Hladky &lt;jhladky@redhat.com&gt;
Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/
Reported-by: Srikanth Aithal &lt;Srikanth.Aithal@amd.com&gt;
Reported-by: Suneeth D &lt;Suneeth.D@amd.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Borislav Petkov &lt;bp@alien8.de&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Jiri Hladky &lt;jhladky@redhat.com&gt;
Cc: Libo Chen &lt;libo.chen@oracle.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sched/deadline: Fix dl_server runtime calculation formula</title>
<updated>2025-07-04T08:35:56Z</updated>
<author>
<name>kuyo chang</name>
<email>kuyo.chang@mediatek.com</email>
</author>
<published>2025-07-02T02:12:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=fc975cfb36393db1db517fbbe366e550bcdcff14'/>
<id>urn:sha1:fc975cfb36393db1db517fbbe366e550bcdcff14</id>
<content type='text'>
In our testing with 6.12 based kernel on a big.LITTLE system, we were
seeing instances of RT tasks being blocked from running on the LITTLE
cpus for multiple seconds of time, apparently by the dl_server. This
far exceeds the default configured 50ms per second runtime.

This is due to the fair dl_server runtime calculation being scaled
for frequency &amp; capacity of the cpu.

Consider the following case under a Big.LITTLE architecture:
Assume the runtime is: 50,000,000 ns, and Frequency/capacity
scale-invariance defined as below:
Frequency scale-invariance: 100
Capacity scale-invariance: 50
First by Frequency scale-invariance,
the runtime is scaled to 50,000,000 * 100 &gt;&gt; 10 = 4,882,812
Then by capacity scale-invariance,
it is further scaled to 4,882,812 * 50 &gt;&gt; 10 = 238,418.
So it will scaled to 238,418 ns.

This smaller "accounted runtime" value is what ends up being
subtracted against the fair-server's runtime for the current period.
Thus after 50ms of real time, we've only accounted ~238us against the
fair servers runtime. This 209:1 ratio in this example means that on
the smaller cpu the fair server is allowed to continue running,
blocking RT tasks, for over 10 seconds before it exhausts its supposed
50ms of runtime.  And on other hardware configurations it can be even
worse.

For the fair deadline_server, to prevent realtime tasks from being
unexpectedly delayed, we really do want to use fixed time, and not
scaled time for smaller capacity/frequency cpus. So remove the scaling
from the fair server's accounting to fix this.

Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Suggested-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Suggested-by: John Stultz &lt;jstultz@google.com&gt;
Signed-off-by: kuyo chang &lt;kuyo.chang@mediatek.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Acked-by: John Stultz &lt;jstultz@google.com&gt;
Tested-by: John Stultz &lt;jstultz@google.com&gt;
Link: https://lore.kernel.org/r/20250702021440.2594736-1-kuyo.chang@mediatek.com
</content>
</entry>
<entry>
<title>sched/core: Fix migrate_swap() vs. hotplug</title>
<updated>2025-07-01T13:02:03Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2025-06-05T10:00:09Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=009836b4fa52f92cba33618e773b1094affa8cd2'/>
<id>urn:sha1:009836b4fa52f92cba33618e773b1094affa8cd2</id>
<content type='text'>
On Mon, Jun 02, 2025 at 03:22:13PM +0800, Kuyo Chang wrote:

&gt; So, the potential race scenario is:
&gt;
&gt; 	CPU0							CPU1
&gt; 	// doing migrate_swap(cpu0/cpu1)
&gt; 	stop_two_cpus()
&gt; 							  ...
&gt; 							 // doing _cpu_down()
&gt; 							      sched_cpu_deactivate()
&gt; 								set_cpu_active(cpu, false);
&gt; 								balance_push_set(cpu, true);
&gt; 	cpu_stop_queue_two_works
&gt; 	    __cpu_stop_queue_work(stopper1,...);
&gt; 	    __cpu_stop_queue_work(stopper2,..);
&gt; 	stop_cpus_in_progress -&gt; true
&gt; 		preempt_enable();
&gt; 								...
&gt; 							1st balance_push
&gt; 							stop_one_cpu_nowait
&gt; 							cpu_stop_queue_work
&gt; 							__cpu_stop_queue_work
&gt; 							list_add_tail  -&gt; 1st add push_work
&gt; 							wake_up_q(&amp;wakeq);  -&gt; "wakeq is empty.
&gt; 										This implies that the stopper is at wakeq@migrate_swap."
&gt; 	preempt_disable
&gt; 	wake_up_q(&amp;wakeq);
&gt; 	        wake_up_process // wakeup migrate/0
&gt; 		    try_to_wake_up
&gt; 		        ttwu_queue
&gt; 		            ttwu_queue_cond -&gt;meet below case
&gt; 		                if (cpu == smp_processor_id())
&gt; 			         return false;
&gt; 			ttwu_do_activate
&gt; 			//migrate/0 wakeup done
&gt; 		wake_up_process // wakeup migrate/1
&gt; 	           try_to_wake_up
&gt; 		    ttwu_queue
&gt; 			ttwu_queue_cond
&gt; 		        ttwu_queue_wakelist
&gt; 			__ttwu_queue_wakelist
&gt; 			__smp_call_single_queue
&gt; 	preempt_enable();
&gt;
&gt; 							2nd balance_push
&gt; 							stop_one_cpu_nowait
&gt; 							cpu_stop_queue_work
&gt; 							__cpu_stop_queue_work
&gt; 							list_add_tail  -&gt; 2nd add push_work, so the double list add is detected
&gt; 							...
&gt; 							...
&gt; 							cpu1 get ipi, do sched_ttwu_pending, wakeup migrate/1
&gt;

So this balance_push() is part of schedule(), and schedule() is supposed
to switch to stopper task, but because of this race condition, stopper
task is stuck in WAKING state and not actually visible to be picked.

Therefore CPU1 can do another schedule() and end up doing another
balance_push() even though the last one hasn't been done yet.

This is a confluence of fail, where both wake_q and ttwu_wakelist can
cause crucial wakeups to be delayed, resulting in the malfunction of
balance_push.

Since there is only a single stopper thread to be woken, the wake_q
doesn't really add anything here, and can be removed in favour of
direct wakeups of the stopper thread.

Then add a clause to ttwu_queue_cond() to ensure the stopper threads
are never queued / delayed.

Of all 3 moving parts, the last addition was the balance_push()
machinery, so pick that as the point the bug was introduced.

Fixes: 2558aacff858 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
Reported-by: Kuyo Chang &lt;kuyo.chang@mediatek.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Kuyo Chang &lt;kuyo.chang@mediatek.com&gt;
Link: https://lkml.kernel.org/r/20250605100009.GO39944@noisy.programming.kicks-ass.net
</content>
</entry>
<entry>
<title>sched: Fix preemption string of preempt_dynamic_none</title>
<updated>2025-07-01T13:02:02Z</updated>
<author>
<name>Thomas Weißschuh</name>
<email>thomas.weissschuh@linutronix.de</email>
</author>
<published>2025-06-26T09:23:44Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3ebb1b6522392f64902b4e96954e35927354aa27'/>
<id>urn:sha1:3ebb1b6522392f64902b4e96954e35927354aa27</id>
<content type='text'>
Zero is a valid value for "preempt_dynamic_mode", namely
"preempt_dynamic_none".

Fix the off-by-one in preempt_model_str(), so that "preempty_dynamic_none"
is correctly formatted as PREEMPT(none) instead of PREEMPT(undef).

Fixes: 8bdc5daaa01e ("sched: Add a generic function to return the preemption string")
Signed-off-by: Thomas Weißschuh &lt;thomas.weissschuh@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Tested-by: Shrikanth Hegde &lt;sshegde@linux.ibm.com&gt;
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20250626-preempt-str-none-v2-1-526213b70a89@linutronix.de
</content>
</entry>
<entry>
<title>sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group()</title>
<updated>2025-06-17T18:19:55Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-06-16T20:13:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=33796b91871ad4010c8188372dd1faf97cf0f1c0'/>
<id>urn:sha1:33796b91871ad4010c8188372dd1faf97cf0f1c0</id>
<content type='text'>
During task_group creation, sched_create_group() calls
scx_group_set_weight() with CGROUP_WEIGHT_DFL to initialize the sched_ext
portion. This is premature and ends up calling ops.cgroup_set_weight() with
an incorrect @cgrp before ops.cgroup_init() is called.

sched_create_group() should just initialize SCX related fields in the new
task_group. Fix it by factoring out scx_tg_init() from sched_init() and
making sched_create_group() call that function instead of
scx_group_set_weight().

v2: Retain CONFIG_EXT_GROUP_SCHED ifdef in sched_init() as removing it leads
    to build failures on !CONFIG_GROUP_SCHED configs.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
</content>
</entry>
</feed>
