<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sched/rt.c, branch v6.16</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.16</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.16'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2025-04-08T18:55:55Z</updated>
<entry>
<title>sched/rt: Fix race in push_rt_task</title>
<updated>2025-04-08T18:55:55Z</updated>
<author>
<name>Harshit Agarwal</name>
<email>harshit@nutanix.com</email>
</author>
<published>2025-02-25T18:05:53Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=690e47d1403e90b7f2366f03b52ed3304194c793'/>
<id>urn:sha1:690e47d1403e90b7f2366f03b52ed3304194c793</id>
<content type='text'>
Overview
========
When a CPU chooses to call push_rt_task and picks a task to push to
another CPU's runqueue then it will call find_lock_lowest_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.

Crashes
=======
This bug resulted in quite a few flavors of crashes triggering kernel
panics with various crash signatures such as assert failures, page
faults, null pointer dereferences, and queue corruption errors all
coming from scheduler itself.

Some of the crashes:
-&gt; kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx &gt;= MAX_RT_PRIO)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? pick_next_task_rt+0x6e/0x1d0
   ? do_error_trap+0x64/0xa0
   ? pick_next_task_rt+0x6e/0x1d0
   ? exc_invalid_op+0x4c/0x60
   ? pick_next_task_rt+0x6e/0x1d0
   ? asm_exc_invalid_op+0x12/0x20
   ? pick_next_task_rt+0x6e/0x1d0
   __schedule+0x5cb/0x790
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-&gt; BUG: kernel NULL pointer dereference, address: 00000000000000c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? __warn+0x8a/0xe0
   ? exc_page_fault+0x3d6/0x520
   ? asm_exc_page_fault+0x1e/0x30
   ? pick_next_task_rt+0xb5/0x1d0
   ? pick_next_task_rt+0x8c/0x1d0
   __schedule+0x583/0x7e0
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-&gt; BUG: unable to handle page fault for address: ffff9464daea5900
   kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq-&gt;cpu != task_cpu(p))

-&gt; kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq-&gt;nr_running)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? do_error_trap+0x64/0xa0
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? exc_invalid_op+0x4c/0x60
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? asm_exc_invalid_op+0x12/0x20
   ? dequeue_top_rt_rq+0xa2/0xb0
   dequeue_rt_entity+0x1f/0x70
   dequeue_task_rt+0x2d/0x70
   __schedule+0x1a8/0x7e0
   ? blk_finish_plug+0x25/0x40
   schedule+0x3c/0xb0
   futex_wait_queue_me+0xb6/0x120
   futex_wait+0xd9/0x240
   do_futex+0x344/0xa90
   ? get_mm_exe_file+0x30/0x60
   ? audit_exe_compare+0x58/0x70
   ? audit_filter_rules.constprop.26+0x65e/0x1220
   __x64_sys_futex+0x148/0x1f0
   do_syscall_64+0x30/0x80
   entry_SYSCALL_64_after_hwframe+0x62/0xc7

-&gt; BUG: unable to handle page fault for address: ffff8cf3608bc2c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? spurious_kernel_fault+0x171/0x1c0
   ? exc_page_fault+0x3b6/0x520
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? asm_exc_page_fault+0x1e/0x30
   ? _cond_resched+0x15/0x30
   ? futex_wait_queue_me+0xc8/0x120
   ? futex_wait+0xd9/0x240
   ? try_to_wake_up+0x1b8/0x490
   ? futex_wake+0x78/0x160
   ? do_futex+0xcd/0xa90
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? plist_del+0x6a/0xd0
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? dequeue_pushable_task+0x20/0x70
   ? __schedule+0x382/0x7e0
   ? asm_sysvec_reschedule_ipi+0xa/0x20
   ? schedule+0x3c/0xb0
   ? exit_to_user_mode_prepare+0x9e/0x150
   ? irqentry_exit_to_user_mode+0x5/0x30
   ? asm_sysvec_reschedule_ipi+0x12/0x20

Above are some of the common examples of the crashes that were observed
due to this issue.

Details
=======
Let's look at the following scenario to understand this race.

1) CPU A enters push_rt_task
  a) CPU A has chosen next_task = task p.
  b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq).
  c) CPU A identifies CPU X as a destination CPU (X &lt; Z).
  d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq).
  e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has
     locked CPU X’s rq, and thus, CPU A must wait.

2) At CPU Z
  a) Previous task has completed execution and thus, CPU Z enters
     schedule, locks its own rq after CPU A releases it.
  b) CPU Z dequeues previous task and begins executing task p.
  c) CPU Z unlocks its rq.
  d) Task p yields the CPU (ex. by doing IO or waiting to acquire a
     lock) which triggers the schedule function on CPU Z.
  e) CPU Z enters schedule again, locks its own rq, and dequeues task p.
  f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq.

3) At CPU B
  a) CPU B enters try_to_wake_up with input task p.
  b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates
     B.state = WAKING.
  c) CPU B via select_task_rq determines CPU Y as the target CPU.

4) The race
  a) CPU A acquires CPU X’s lock and relocks CPU Z.
  b) CPU A reads task p.cpu = Z and incorrectly concludes task p is
     still on CPU Z.
  c) CPU A failed to notice task p had been dequeued from CPU Z while
     CPU A was waiting for locks in double_lock_balance. If CPU A knew
     that task p had been dequeued, it would return NULL forcing
     push_rt_task to give up the task p's migration.
  d) CPU B updates task p.cpu = Y and calls ttwu_queue.
  e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task
     p.on_rq = 1.
  f) CPU B unlocks CPU Y, triggering memory synchronization.
  g) CPU A reads task p.on_rq = 1, cementing its assumption that task p
     has not migrated.
  h) CPU A decides to migrate p to CPU X.

This leads to A dequeuing p from Y's queue and various crashes down the
line.

Solution
========
The solution here is fairly simple. After obtaining the lock (at 4a),
the check is enhanced to make sure that the task is still at the head of
the pushable tasks list. If not, then it is anyway not suitable for
being pushed out.

Testing
=======
The fix is tested on a cluster of 3 nodes, where the panics due to this
are hit every couple of days. A fix similar to this was deployed on such
cluster and was stable for more than 30 days.

Co-developed-by: Jon Kohler &lt;jon@nutanix.com&gt;
Signed-off-by: Jon Kohler &lt;jon@nutanix.com&gt;
Co-developed-by: Gauri Patwardhan &lt;gauri.patwardhan@nutanix.com&gt;
Signed-off-by: Gauri Patwardhan &lt;gauri.patwardhan@nutanix.com&gt;
Co-developed-by: Rahul Chunduru &lt;rahul.chunduru@nutanix.com&gt;
Signed-off-by: Rahul Chunduru &lt;rahul.chunduru@nutanix.com&gt;
Signed-off-by: Harshit Agarwal &lt;harshit@nutanix.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: "Steven Rostedt (Google)" &lt;rostedt@goodmis.org&gt;
Reviewed-by: Phil Auld &lt;pauld@redhat.com&gt;
Tested-by: Will Ton &lt;william.ton@nutanix.com&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250225180553.167995-1-harshit@nutanix.com
</content>
</entry>
<entry>
<title>sched: Add RT_GROUP WARN checks for non-root task_groups</title>
<updated>2025-04-08T18:55:54Z</updated>
<author>
<name>Michal Koutný</name>
<email>mkoutny@suse.com</email>
</author>
<published>2025-03-10T17:04:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=87f1fb77d87a6dac9968a321bb10799ae6d2039c'/>
<id>urn:sha1:87f1fb77d87a6dac9968a321bb10799ae6d2039c</id>
<content type='text'>
With CONFIG_RT_GROUP_SCHED but runtime disabling of RT_GROUPs we expect
the existence of the root task_group only and all rt_sched_entity'ies
should be queued on root's rt_rq.

If we get a non-root RT_GROUP something went wrong.

Signed-off-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20250310170442.504716-9-mkoutny@suse.com
</content>
</entry>
<entry>
<title>sched: Do not construct nor expose RT_GROUP_SCHED structures if disabled</title>
<updated>2025-04-08T18:55:54Z</updated>
<author>
<name>Michal Koutný</name>
<email>mkoutny@suse.com</email>
</author>
<published>2025-03-10T17:04:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d6809c2f606c14f9e95be87d75a576901d2fa050'/>
<id>urn:sha1:d6809c2f606c14f9e95be87d75a576901d2fa050</id>
<content type='text'>
Thanks to kernel cmdline being available early, before any
cgroup hierarchy exists, we can achieve the RT_GROUP_SCHED boottime
disabling goal by simply skipping any creation (and destruction) of
RT_GROUP data and its exposure via RT attributes.

We can do this thanks to previously placed runtime guards that would
redirect all operations to root_task_group's data when RT_GROUP_SCHED
disabled.

Signed-off-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20250310170442.504716-8-mkoutny@suse.com
</content>
</entry>
<entry>
<title>sched: Bypass bandwitdh checks with runtime disabled RT_GROUP_SCHED</title>
<updated>2025-04-08T18:55:54Z</updated>
<author>
<name>Michal Koutný</name>
<email>mkoutny@suse.com</email>
</author>
<published>2025-03-10T17:04:38Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=277e0909754e9f3c82def97150d2f3ea700098f1'/>
<id>urn:sha1:277e0909754e9f3c82def97150d2f3ea700098f1</id>
<content type='text'>
When RT_GROUPs are compiled but not exposed, their bandwidth cannot
be configured (and it is not initialized for non-root task_groups neither).
Therefore bypass any checks of task vs task_group bandwidth.

This will achieve behavior very similar to setups that have
!CONFIG_RT_GROUP_SCHED and attach cpu controller to cgroup v2 hierarchy.
(On a related note, this may allow having RT tasks with
CONFIG_RT_GROUP_SCHED and cgroup v2 hierarchy.)

Signed-off-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20250310170442.504716-7-mkoutny@suse.com
</content>
</entry>
<entry>
<title>sched: Skip non-root task_groups with disabled RT_GROUP_SCHED</title>
<updated>2025-04-08T18:55:53Z</updated>
<author>
<name>Michal Koutný</name>
<email>mkoutny@suse.com</email>
</author>
<published>2025-03-10T17:04:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=61d3164fec2ed283645dc17fcc51959e8f361e18'/>
<id>urn:sha1:61d3164fec2ed283645dc17fcc51959e8f361e18</id>
<content type='text'>
First, we want to prevent placement of RT tasks on non-root rt_rqs which
we achieve in the task migration code that'd fall back to
root_task_group's rt_rq.

Second, we want to work with only root_task_group's rt_rq when iterating
all "real" rt_rqs when RT_GROUP is disabled. To achieve this we keep
root_task_group as the first one on the task_groups and break out
quickly.

Signed-off-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20250310170442.504716-6-mkoutny@suse.com
</content>
</entry>
<entry>
<title>sched: Always initialize rt_rq's task_group</title>
<updated>2025-04-08T18:55:53Z</updated>
<author>
<name>Michal Koutný</name>
<email>mkoutny@suse.com</email>
</author>
<published>2025-03-10T17:04:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a5a25b32c08a31c03258ec4960bec26caaf76e9a'/>
<id>urn:sha1:a5a25b32c08a31c03258ec4960bec26caaf76e9a</id>
<content type='text'>
rt_rq-&gt;tg may be NULL which denotes the root task_group.
Store the pointer to root_task_group directly so that callers may use
rt_rq-&gt;tg homogenously.

root_task_group exists always with CONFIG_CGROUPS_SCHED,
CONFIG_RT_GROUP_SCHED depends on that.

This changes root level rt_rq's default limit from infinity to the
value of (originally) global RT throttling.

Signed-off-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20250310170442.504716-4-mkoutny@suse.com
</content>
</entry>
<entry>
<title>sched: Remove unneeed macro wrap</title>
<updated>2025-04-08T18:55:53Z</updated>
<author>
<name>Michal Koutný</name>
<email>mkoutny@suse.com</email>
</author>
<published>2025-03-10T17:04:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e285313f0848157cc3c6827d233a2510167b50cf'/>
<id>urn:sha1:e285313f0848157cc3c6827d233a2510167b50cf</id>
<content type='text'>
rt_entity_is_task has split definitions based on CONFIG_RT_GROUP_SCHED,
therefore we can use it always. No functional change intended.

Signed-off-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20250310170442.504716-3-mkoutny@suse.com
</content>
</entry>
<entry>
<title>sched: Convert CONFIG_RT_GROUP_SCHED macros to code conditions</title>
<updated>2025-04-08T18:55:52Z</updated>
<author>
<name>Michal Koutný</name>
<email>mkoutny@suse.com</email>
</author>
<published>2025-03-10T17:04:33Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=433bce5dadb4ec3d5eda99c5125926c045b79005'/>
<id>urn:sha1:433bce5dadb4ec3d5eda99c5125926c045b79005</id>
<content type='text'>
Convert the blocks guarded by macros to regular code so that the RT
group code gets more compile validation. Reasoning is in
Documentation/process/coding-style.rst 21) Conditional Compilation.
With that, no functional change is expected.

Signed-off-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20250310170442.504716-2-mkoutny@suse.com
</content>
</entry>
<entry>
<title>Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2025-03-25T17:54:15Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-03-25T17:54:15Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a50b4fe095fb98e0b7da03b0a42fd1247284868e'/>
<id>urn:sha1:a50b4fe095fb98e0b7da03b0a42fd1247284868e</id>
<content type='text'>
Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T-&gt;function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
</content>
</entry>
<entry>
<title>sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional</title>
<updated>2025-03-19T21:20:53Z</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2025-03-17T10:42:54Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=dd5bdaf2b72da81d57f4f99e518af80002b6562e'/>
<id>urn:sha1:dd5bdaf2b72da81d57f4f99e518af80002b6562e</id>
<content type='text'>
All the big Linux distros enable CONFIG_SCHED_DEBUG, because
the various features it provides help not just with kernel
development, but with system administration and user-space
software development as well.

Reflect this reality and enable this functionality
unconditionally.

Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Tested-by: Shrikanth Hegde &lt;sshegde@linux.ibm.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Cc: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Cc: Dietmar Eggemann &lt;dietmar.eggemann@arm.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Ben Segall &lt;bsegall@google.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Valentin Schneider &lt;vschneid@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: https://lore.kernel.org/r/20250317104257.3496611-4-mingo@kernel.org
</content>
</entry>
</feed>
