<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sched, branch v6.19</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.19</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.19'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2026-02-07T17:10:42Z</updated>
<entry>
<title>Merge tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2026-02-07T17:10:42Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-02-07T17:10:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=dda5df9823630a26ed24ca9150b33a7f56ba4546'/>
<id>urn:sha1:dda5df9823630a26ed24ca9150b33a7f56ba4546</id>
<content type='text'>
Pull scheduler fixes from Ingo Molnar:
 "Miscellaneous MMCID fixes to address bugs and performance regressions
  in the recent rewrite of the SCHED_MM_CID management code:

   - Fix livelock triggered by BPF CI testing

   - Fix hard lockup on weakly ordered systems

   - Simplify the dropping of CIDs in the exit path by removing an
     unintended transition phase

   - Fix performance/scalability regression on a thread-pool benchmark
     by optimizing transitional CIDs when scheduling out"

* tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/mmcid: Optimize transitional CIDs when scheduling out
  sched/mmcid: Drop per CPU CID immediately when switching to per task mode
  sched/mmcid: Protect transition on weakly ordered systems
  sched/mmcid: Prevent live lock on task to CPU mode transition
</content>
</entry>
<entry>
<title>Merge tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext</title>
<updated>2026-02-04T23:11:24Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-02-04T23:11:24Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3c7b4d1994f63d6fa3984d7d5ad06dbaad96f167'/>
<id>urn:sha1:3c7b4d1994f63d6fa3984d7d5ad06dbaad96f167</id>
<content type='text'>
Pull sched_ext fix from Tejun Heo:

 - Fix race where sched_class operations (sched_setscheduler() and
   friends) could be invoked on dead tasks after sched_ext_dead()
   already ran, causing invalid SCX task state transitions and NULL
   pointer dereferences.

   This was a regression from the cgroup exit ordering fix which
   moved sched_ext_free() to finish_task_switch().

* tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Short-circuit sched_class operations on dead tasks
</content>
</entry>
<entry>
<title>sched_ext: Short-circuit sched_class operations on dead tasks</title>
<updated>2026-02-04T22:22:11Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-02-04T20:07:55Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0eca95cba2b7bf7b7b4f2fa90734a85fcaa72782'/>
<id>urn:sha1:0eca95cba2b7bf7b7b4f2fa90734a85fcaa72782</id>
<content type='text'>
7900aa699c34 ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free()
to finish_task_switch()") moved sched_ext_free() to finish_task_switch() and
renamed it to sched_ext_dead() to fix cgroup exit ordering issues. However,
this created a race window where certain sched_class ops may be invoked on
dead tasks leading to failures - e.g. sched_setscheduler() may try to switch a
task which finished sched_ext_dead() back into SCX triggering invalid SCX task
state transitions.

Add task_dead_and_done() which tests whether a task is TASK_DEAD and has
completed its final context switch, and use it to short-circuit sched_class
operations which may be called on dead tasks.

Fixes: 7900aa699c34 ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()")
Reported-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Link: http://lkml.kernel.org/r/20260202151341.796959-1-arighi@nvidia.com
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/mmcid: Optimize transitional CIDs when scheduling out</title>
<updated>2026-02-04T11:21:12Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@kernel.org</email>
</author>
<published>2026-02-02T09:39:55Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4463c7aa11a6e67169ae48c6804968960c4bffea'/>
<id>urn:sha1:4463c7aa11a6e67169ae48c6804968960c4bffea</id>
<content type='text'>
During the investigation of the various transition mode issues
instrumentation revealed that the amount of bitmap operations can be
significantly reduced when a task with a transitional CID schedules out
after the fixup function completed and disabled the transition mode.

At that point the mode is stable and therefore it is not required to drop
the transitional CID back into the pool. As the fixup is complete the
potential exhaustion of the CID pool is not longer possible, so the CID can
be transferred to the scheduling out task or to the CPU depending on the
current ownership mode.

The racy snapshot of mm_cid::mode which contains both the ownership state
and the transition bit is valid because runqueue lock is held and the fixup
function of a concurrent mode switch is serialized.

Assigning the ownership right there not only spares the bitmap access for
dropping the CID it also avoids it when the task is scheduled back in as it
directly hits the fast path in both modes when the CID is within the
optimal range. If it's outside the range the next schedule in will need to
converge so dropping it right away is sensible. In the good case this also
allows to go into the fast path on the next schedule in operation.

With a thread pool benchmark which is configured to cross the mode switch
boundaries frequently this reduces the number of bitmap operations by about
30% and increases the fastpath utilization in the low single digit
percentage range.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Link: https://patch.msgid.link/20260201192835.100194627@kernel.org
</content>
</entry>
<entry>
<title>sched/mmcid: Drop per CPU CID immediately when switching to per task mode</title>
<updated>2026-02-04T11:21:12Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@kernel.org</email>
</author>
<published>2026-02-02T09:39:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=007d84287c7466ca68a5809b616338214dc5b77b'/>
<id>urn:sha1:007d84287c7466ca68a5809b616338214dc5b77b</id>
<content type='text'>
When a exiting task initiates the switch from per CPU back to per task
mode, it has already dropped its CID and marked itself inactive. But a
leftover from an earlier iteration of the rework then reassigns the per
CPU CID to the exiting task with the transition bit set.

That's wrong as the task is already marked CID inactive, which means it is
inconsistent state. It's harmless because the CID is marked in transit and
therefore dropped back into the pool when the exiting task schedules out
either through preemption or the final schedule().

Simply drop the per CPU CID when the exiting task triggered the transition.

Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Thomas Gleixner &lt;tglx@kernel.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Link: https://patch.msgid.link/20260201192835.032221009@kernel.org
</content>
</entry>
<entry>
<title>sched/mmcid: Protect transition on weakly ordered systems</title>
<updated>2026-02-04T11:21:12Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@kernel.org</email>
</author>
<published>2026-02-02T09:39:45Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=47ee94efccf6732e4ef1a815c451aacaf1464757'/>
<id>urn:sha1:47ee94efccf6732e4ef1a815c451aacaf1464757</id>
<content type='text'>
Shrikanth reported a hard lockup which he observed once. The stack trace
shows the following CID related participants:

  watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188
  NIP: mm_get_cid+0xe8/0x188
  LR:  mm_get_cid+0x108/0x188
   mm_cid_switch_to+0x3c4/0x52c
   __schedule+0x47c/0x700
   schedule_idle+0x3c/0x64
   do_idle+0x160/0x1b0
   cpu_startup_entry+0x48/0x50
   start_secondary+0x284/0x288
   start_secondary_prolog+0x10/0x14

  watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
  NIP: plpar_hcall_norets_notrace+0x18/0x2c
  LR:  queued_spin_lock_slowpath+0xd88/0x15d0
   _raw_spin_lock+0x80/0xa0
   raw_spin_rq_lock_nested+0x3c/0xf8
   mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
   sched_mm_cid_exit+0x108/0x22c
   do_exit+0xf4/0x5d0
   make_task_dead+0x0/0x178
   system_call_exception+0x128/0x390
   system_call_vectored_common+0x15c/0x2ec

The task on CPU11 is running the CID ownership mode change fixup function
and is stuck on a runqueue lock. The task on CPU23 is trying to get a CID
from the pool with the same runqueue lock held, but the pool is empty.

After decoding a similar issue in the opposite direction switching from per
task to per CPU mode the tool which models the possible scenarios failed to
come up with a similar loop hole.

This showed up only once, was not reproducible and according to tooling not
related to a overlooked scheduling scenario permutation. But the fact that
it was observed on a PowerPC system gave the right hint: PowerPC is a
weakly ordered architecture.

The transition mechanism does:

    WRITE_ONCE(mm-&gt;mm_cid.transit, MM_CID_TRANSIT);
    WRITE_ONCE(mm-&gt;mm_cid.percpu, new_mode);

    fixup()

    WRITE_ONCE(mm-&gt;mm_cid.transit, 0);

mm_cid_schedin() does:

    if (!READ_ONCE(mm-&gt;mm_cid.percpu))
       ...
       cid |= READ_ONCE(mm-&gt;mm_cid.transit);

so weakly ordered systems can observe percpu == false and transit == 0 even
if the fixup function has not yet completed. As a consequence the task will
not drop the CID when scheduling out before the fixup is completed, which
means the CID space can be exhausted and the next task scheduling in will
loop in mm_get_cid() and the fixup thread can livelock on the held runqueue
lock as above.

This could obviously be solved by using:
     smp_store_release(&amp;mm-&gt;mm_cid.percpu, true);
and
     smp_load_acquire(&amp;mm-&gt;mm_cid.percpu);

but that brings a memory barrier back into the scheduler hotpath, which was
just designed out by the CID rewrite.

That can be completely avoided by combining the per CPU mode and the
transit storage into a single mm_cid::mode member and ordering the stores
against the fixup functions to prevent the CPU from reordering them.

That makes the update of both states atomic and a concurrent read observes
always consistent state.

The price is an additional AND operation in mm_cid_schedin() to evaluate
the per CPU or the per task path, but that's in the noise even on strongly
ordered architectures as the actual load can be significantly more
expensive and the conditional branch evaluation is there anyway.

Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/bdfea828-4585-40e8-8835-247c6a8a76b0@linux.ibm.com
Reported-by: Shrikanth Hegde &lt;sshegde@linux.ibm.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@kernel.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Link: https://patch.msgid.link/20260201192834.965217106@kernel.org
</content>
</entry>
<entry>
<title>sched/mmcid: Prevent live lock on task to CPU mode transition</title>
<updated>2026-02-04T11:21:11Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@kernel.org</email>
</author>
<published>2026-02-02T09:39:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4327fb13fa47770183c4850c35382c30ba5f939d'/>
<id>urn:sha1:4327fb13fa47770183c4850c35382c30ba5f939d</id>
<content type='text'>
Ihor reported a BPF CI failure which turned out to be a live lock in the
MM_CID management. The scenario is:

A test program creates the 5th thread, which means the MM_CID users become
more than the number of CPUs (four in this example), so it switches to per
CPU ownership mode.

At this point each live task of the program has a CID associated. Assume
thread creation order assignment for simplicity.

   T0     CID0  runs fork() and creates T4
   T1 	  CID1
   T2 	  CID2
   T3 	  CID3
   T4       ---   not visible yet

T0 sets mm_cid::percpu = true and transfers its own CID to CPU0 where it
runs on and then starts the fixup which walks through the threads to
transfer the per task CIDs either to the CPU the task is running on or drop
it back into the pool if the task is not on a CPU.

During that T1 - T3 are free to schedule in and out before the fixup caught
up with them. Going through all possible permutations with a python script
revealed a few problematic cases. The most trivial one is:

   T1 schedules in on CPU1 and observes percpu == true, so it transfers
      its CID to CPU1

   T1 is migrated to CPU2 and schedule in observes percpu == true, but
      CPU2 does not have a CID associated and T1 transferred its own to
      CPU1

      So it has to allocate one with CPU2 runqueue lock held, but the
      pool is empty, so it keeps looping in mm_get_cid().

Now T0 reaches T1 in the thread walk and tries to lock the corresponding
runqueue lock, which is held causing a full live lock.

There is a similar scenario in the reverse direction of switching from per
CPU to task mode which is way more obvious and got therefore addressed by
an intermediate mode. In this mode the CIDs are marked with MM_CID_TRANSIT,
which means that they are neither owned by the CPU nor by the task. When a
task schedules out with a transit CID it drops the CID back into the pool
making it available for others to use temporarily. Once the task which
initiated the mode switch finished the fixup it clears the transit mode and
the process goes back into per task ownership mode.

Unfortunately this insight was not mapped back to the task to CPU mode
switch as the above described scenario was not considered in the analysis.

Apply the same transit mechanism to the task to CPU mode switch to handle
these problematic cases correctly.

As with the CPU to task transition this results in a potential temporary
contention on the CID bitmap, but that's only for the time it takes to
complete the transition. After that it stays in steady mode which does not
touch the bitmap at all.

Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/2b7463d7-0f58-4e34-9775-6e2115cfb971@linux.dev
Reported-by: Ihor Solodrai &lt;ihor.solodrai@linux.dev&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@kernel.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Link: https://patch.msgid.link/20260201192834.897115238@kernel.org
</content>
</entry>
<entry>
<title>sched/deadline: Fix 'stuck' dl_server</title>
<updated>2026-01-30T22:06:06Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2026-01-30T12:41:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=115135422562e2f791e98a6f55ec57b2da3b3a95'/>
<id>urn:sha1:115135422562e2f791e98a6f55ec57b2da3b3a95</id>
<content type='text'>
Andrea reported the dl_server getting stuck for him. He tracked it
down to a state where dl_server_start() saw dl_defer_running==1, but
the dl_server's job is no longer valid at the time of
dl_server_start().

In the state diagram this corresponds to [4] D-&gt;A (or dl_server_stop()
due to no more runnable tasks) followed by [1], which in case of a
lapsed deadline must then be A-&gt;B.

Now our A has dl_defer_running==1, while B demands
dl_defer_running==0, therefore it must get cleared when the CBS wakeup
rules demand a replenish.

Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Reported-by: Andrea Righi arighi@nvidia.com
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Tested-by: Andrea Righi arighi@nvidia.com
Link: https://lkml.kernel.org/r/20260123161645.2181752-1-arighi@nvidia.com
Link: https://patch.msgid.link/20260130124100.GC1079264@noisy.programming.kicks-ass.net
</content>
</entry>
<entry>
<title>sched/fair: Revert force wakeup preemption</title>
<updated>2026-01-23T10:53:20Z</updated>
<author>
<name>Vincent Guittot</name>
<email>vincent.guittot@linaro.org</email>
</author>
<published>2026-01-23T10:28:58Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=15257cc2f905dbf5813c0bfdd3c15885f28093c4'/>
<id>urn:sha1:15257cc2f905dbf5813c0bfdd3c15885f28093c4</id>
<content type='text'>
This agressively bypasses run_to_parity and slice protection with the
assumpiton that this is what waker wants but there is no garantee that
the wakee will be the next to run. It is a better choice to use
yield_to_task or WF_SYNC in such case.

This increases the number of resched and preemption because a task becomes
quickly "ineligible" when it runs; We update the task vruntime periodically
and before the task exhausted its slice or at least quantum.

Example:
2 tasks A and B wake up simultaneously with lag = 0. Both are
eligible. Task A runs 1st and wakes up task C. Scheduler updates task
A's vruntime which becomes greater than average runtime as all others
have a lag == 0 and didn't run yet. Now task A is ineligible because
it received more runtime than the other task but it has not yet
exhausted its slice nor a min quantum. We force preemption, disable
protection but Task B will run 1st not task C.

Sidenote, DELAY_ZERO increases this effect by clearing positive lag at
wake up.

Fixes: e837456fdca8 ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals")
Signed-off-by: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://patch.msgid.link/20260123102858.52428-1-vincent.guittot@linaro.org
</content>
</entry>
<entry>
<title>sched/fair: Disable scheduler feature NEXT_BUDDY</title>
<updated>2026-01-23T10:53:19Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@techsingularity.net</email>
</author>
<published>2026-01-20T11:33:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4f70f106bca1a56bd66d00830ac91680bd754974'/>
<id>urn:sha1:4f70f106bca1a56bd66d00830ac91680bd754974</id>
<content type='text'>
NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdca8 ("sched/fair:
Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
that this would be a universal win without a crystal ball instruction
but the reported regressions are a concern [1][2] even if gains were
also reported. Specifically;

o mysql with client/server running on different servers regresses
o specjbb reports lower peak metrics
o daytrader regresses

The mysql is realistic and a concern. It needs to be confirmed if
specjbb is simply shifting the point where peak performance is measured
but still a concern. daytrader is considered to be representative of a
real workload.

Access to test machines is currently problematic for verifying any fix to
this problem. Disable NEXT_BUDDY for now by default until the root causes
are addressed.

Signed-off-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Madadi Vineeth Reddy &lt;vineethr@linux.ibm.com&gt;
Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
Link: https://patch.msgid.link/fyqsk63pkoxpeaclyqsm5nwtz3dyejplr7rg6p74xwemfzdzuu@7m7xhs5aqpqw
</content>
</entry>
</feed>
