<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sched, branch v6.14</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.14</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.14'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2025-03-21T15:48:40Z</updated>
<entry>
<title>Merge tag 'sched-urgent-2025-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2025-03-21T15:48:40Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-03-21T15:48:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=cb90c8df91d08aebb62ef77bd1c7f41a31bdc924'/>
<id>urn:sha1:cb90c8df91d08aebb62ef77bd1c7f41a31bdc924</id>
<content type='text'>
Pull scheduler fix from Ingo Molnar:
 "Revert a scheduler performance optimization that regressed other
  workloads"

* tag 'sched-urgent-2025-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "sched/core: Reduce cost of sched_move_task when config autogroup"
</content>
</entry>
<entry>
<title>Revert "sched/core: Reduce cost of sched_move_task when config autogroup"</title>
<updated>2025-03-15T09:34:27Z</updated>
<author>
<name>Dietmar Eggemann</name>
<email>dietmar.eggemann@arm.com</email>
</author>
<published>2025-03-14T15:13:45Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=76f970ce51c80f625eb6ddbb24e9cb51b977b598'/>
<id>urn:sha1:76f970ce51c80f625eb6ddbb24e9cb51b977b598</id>
<content type='text'>
This reverts commit eff6c8ce8d4d7faef75f66614dd20bb50595d261.

Hazem reported a 30% drop in UnixBench spawn test with commit
eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
(aarch64) (single level MC sched domain):

  https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com

There is an early bail from sched_move_task() if p-&gt;sched_task_group is
equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
(Ubuntu '22.04.5 LTS').

So in:

  do_exit()

    sched_autogroup_exit_task()

      sched_move_task()

        if sched_get_task_group(p) == p-&gt;sched_task_group
          return

        /* p is enqueued */
        dequeue_task()              \
        sched_change_group()        |
          task_change_group_fair()  |
            detach_task_cfs_rq()    |                              (1)
            set_task_rq()           |
            attach_task_cfs_rq()    |
        enqueue_task()              /

(1) isn't called for p anymore.

Turns out that the regression is related to sgs-&gt;group_util in
group_is_overloaded() and group_has_capacity(). If (1) isn't called for
all the 'spawn' tasks then sgs-&gt;group_util is ~900 and
sgs-&gt;group_capacity = 1024 (single CPU sched domain) and this leads to
group_is_overloaded() returning true (2) and group_has_capacity() false
(3) much more often compared to the case when (1) is called.

I.e. there are much more cases of 'group_is_overloaded' and
'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
then returns much more often a CPU != smp_processor_id() (5).

This isn't good for these extremely short running tasks (FORK + EXIT)
and also involves calling sched_balance_find_dst_group_cpu() unnecessary
(single CPU sched domain).

Instead if (1) is called for 'p-&gt;flags &amp; PF_EXITING' then the path
(4),(6) is taken much more often.

  select_task_rq_fair(..., wake_flags = WF_FORK)

    cpu = smp_processor_id()

    new_cpu = sched_balance_find_dst_cpu(..., cpu, ...)

      group = sched_balance_find_dst_group(..., cpu)

        do {

          update_sg_wakeup_stats()

            sgs-&gt;group_type = group_classify()

              if group_is_overloaded()                             (2)
                return group_overloaded

              if !group_has_capacity()                             (3)
                return group_fully_busy

              return group_has_spare                               (4)

        } while group

        if local_sgs.group_type &gt; idlest_sgs.group_type
          return idlest                                            (5)

        case group_has_spare:

          if local_sgs.idle_cpus &gt;= idlest_sgs.idle_cpus
            return NULL                                            (6)

Unixbench Tests './Run -c 4 spawn' on:

(a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4')
    and Ubuntu 22.04.5 LTS (aarch64).

    Shell &amp; test run in '/user.slice/user-1000.slice/session-1.scope'.

    w/o patch	w/ patch
    21005	27120

(b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and
    Ubuntu 22.04.5 LTS (x86_64).

    Shell &amp; test run in '/A'.

    w/o patch	w/ patch
    67675	88806

CONFIG_SCHED_AUTOGROUP=y &amp; /sys/proc/kernel/sched_autogroup_enabled equal
0 or 1.

Reported-by: Hazem Mohamed Abuelfotoh &lt;abuehaze@amazon.com&gt;
Signed-off-by: Dietmar Eggemann &lt;dietmar.eggemann@arm.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Reviewed-by: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Tested-by: Hagar Hemdan &lt;hagarhem@amazon.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: https://lore.kernel.org/r/20250314151345.275739-1-dietmar.eggemann@arm.com
</content>
</entry>
<entry>
<title>Merge tag 'sched-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2025-03-14T19:56:46Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-03-14T19:56:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a22ea738f4539afc13aa34a34a631e5aae96bcf8'/>
<id>urn:sha1:a22ea738f4539afc13aa34a34a631e5aae96bcf8</id>
<content type='text'>
Pull scheduler fix from Ingo Molnar:
 "Fix a sleeping-while-atomic bug caused by a recent optimization
  utilizing static keys that didn't consider that the
  static_key_disable() call could be triggered in atomic context.

  Revert the optimization"

* tag 'sched-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/clock: Don't define sched_clock_irqtime as static key
</content>
</entry>
<entry>
<title>Merge tag 'sched_ext-for-6.14-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext</title>
<updated>2025-03-12T21:52:04Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-03-12T21:52:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b7f94fcf55469ad3ef8a74c35b488dbfa314d1bb'/>
<id>urn:sha1:b7f94fcf55469ad3ef8a74c35b488dbfa314d1bb</id>
<content type='text'>
Pull sched_ext fix from Tejun Heo:
 "BPF schedulers could trigger a crash by passing in an invalid CPU to
  the scx_bpf_select_cpu_dfl() helper.

  Fix it by verifying input validity"

* tag 'sched_ext-for-6.14-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
</content>
</entry>
<entry>
<title>sched/clock: Don't define sched_clock_irqtime as static key</title>
<updated>2025-03-10T13:22:58Z</updated>
<author>
<name>Yafang Shao</name>
<email>laoar.shao@gmail.com</email>
</author>
<published>2025-02-05T03:24:38Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f3fa0e40df175acd60b71036b9a1fd62310aec03'/>
<id>urn:sha1:f3fa0e40df175acd60b71036b9a1fd62310aec03</id>
<content type='text'>
The sched_clock_irqtime was defined as a static key in:

  8722903cbb8f ("sched: Define sched_clock_irqtime as static key")

However, this change introduces a 'sleeping in atomic context' warning:

	arch/x86/kernel/tsc.c:1214 mark_tsc_unstable()
	warn: sleeping in atomic context

As analyzed by Dan, the affected code path is as follows:

vcpu_load() &lt;- disables preempt
-&gt; kvm_arch_vcpu_load()
   -&gt; mark_tsc_unstable() &lt;- sleeps

virt/kvm/kvm_main.c
   166  void vcpu_load(struct kvm_vcpu *vcpu)
   167  {
   168          int cpu = get_cpu();
                          ^^^^^^^^^^
This get_cpu() disables preemption.

   169
   170          __this_cpu_write(kvm_running_vcpu, vcpu);
   171          preempt_notifier_register(&amp;vcpu-&gt;preempt_notifier);
   172          kvm_arch_vcpu_load(vcpu, cpu);
   173          put_cpu();
   174  }

arch/x86/kvm/x86.c
  4979          if (unlikely(vcpu-&gt;cpu != cpu) || kvm_check_tsc_unstable()) {
  4980                  s64 tsc_delta = !vcpu-&gt;arch.last_host_tsc ? 0 :
  4981                                  rdtsc() - vcpu-&gt;arch.last_host_tsc;
  4982                  if (tsc_delta &lt; 0)
  4983                          mark_tsc_unstable("KVM discovered backwards TSC");

arch/x86/kernel/tsc.c
    1206 void mark_tsc_unstable(char *reason)
    1207 {
    1208         if (tsc_unstable)
    1209                 return;
    1210
    1211         tsc_unstable = 1;
    1212         if (using_native_sched_clock())
    1213                 clear_sched_clock_stable();
--&gt; 1214         disable_sched_clock_irqtime();
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
kernel/jump_label.c
   245  void static_key_disable(struct static_key *key)
   246  {
   247          cpus_read_lock();
                ^^^^^^^^^^^^^^^^
This lock has a might_sleep() in it which triggers the static checker
warning.

   248          static_key_disable_cpuslocked(key);
   249          cpus_read_unlock();
   250  }

Let revert this change for now as {disable,enable}_sched_clock_irqtime
are used in many places, as pointed out by Sean, including the following:

The code path in clocksource_watchdog():

  clocksource_watchdog()
  |
  -&gt; spin_lock(&amp;watchdog_lock);
     |
     -&gt; __clocksource_unstable()
        |
        -&gt; clocksource.mark_unstable() == tsc_cs_mark_unstable()
           |
           -&gt; disable_sched_clock_irqtime()

And the code path in sched_clock_register():

	/* Cannot register a sched_clock with interrupts on */
	local_irq_save(flags);

	...

	/* Enable IRQ time accounting if we have a fast enough sched_clock() */
	if (irqtime &gt; 0 || (irqtime == -1 &amp;&amp; rate &gt;= 1000000))
		enable_sched_clock_irqtime();

	local_irq_restore(flags);

[ lkp@intel.com: reported a build error in the prev version ]
[ mingo: cherry-picked it over into sched/urgent ]

Closes: https://lore.kernel.org/kvm/37a79ba3-9ce0-479c-a5b0-2bd75d573ed3@stanley.mountain/
Fixes: 8722903cbb8f ("sched: Define sched_clock_irqtime as static key")
Reported-by: Dan Carpenter &lt;dan.carpenter@linaro.org&gt;
Debugged-by: Dan Carpenter &lt;dan.carpenter@linaro.org&gt;
Debugged-by: Sean Christopherson &lt;seanjc@google.com&gt;
Debugged-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Yafang Shao &lt;laoar.shao@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Reviewed-by: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Link: https://lkml.kernel.org/r/20250205032438.14668-1-laoar.shao@gmail.com
</content>
</entry>
<entry>
<title>sched/deadline: Use online cpus for validating runtime</title>
<updated>2025-03-06T09:21:31Z</updated>
<author>
<name>Shrikanth Hegde</name>
<email>sshegde@linux.ibm.com</email>
</author>
<published>2025-03-06T05:29:53Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=14672f059d83f591afb2ee1fff56858efe055e5a'/>
<id>urn:sha1:14672f059d83f591afb2ee1fff56858efe055e5a</id>
<content type='text'>
The ftrace selftest reported a failure because writing -1 to
sched_rt_runtime_us returns -EBUSY. This happens when the possible
CPUs are different from active CPUs.

Active CPUs are part of one root domain, while remaining CPUs are part
of def_root_domain. Since active cpumask is being used, this results in
cpus=0 when a non active CPUs is used in the loop.

Fix it by looping over the online CPUs instead for validating the
bandwidth calculations.

Signed-off-by: Shrikanth Hegde &lt;sshegde@linux.ibm.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Reviewed-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Link: https://lore.kernel.org/r/20250306052954.452005-2-sshegde@linux.ibm.com
</content>
</entry>
<entry>
<title>sched/fair: Fix potential memory corruption in child_cfs_rq_on_list</title>
<updated>2025-03-05T16:30:54Z</updated>
<author>
<name>Zecheng Li</name>
<email>zecheng@google.com</email>
</author>
<published>2025-03-04T21:40:31Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=3b4035ddbfc8e4521f85569998a7569668cccf51'/>
<id>urn:sha1:3b4035ddbfc8e4521f85569998a7569668cccf51</id>
<content type='text'>
child_cfs_rq_on_list attempts to convert a 'prev' pointer to a cfs_rq.
This 'prev' pointer can originate from struct rq's leaf_cfs_rq_list,
making the conversion invalid and potentially leading to memory
corruption. Depending on the relative positions of leaf_cfs_rq_list and
the task group (tg) pointer within the struct, this can cause a memory
fault or access garbage data.

The issue arises in list_add_leaf_cfs_rq, where both
cfs_rq-&gt;leaf_cfs_rq_list and rq-&gt;leaf_cfs_rq_list are added to the same
leaf list. Also, rq-&gt;tmp_alone_branch can be set to rq-&gt;leaf_cfs_rq_list.

This adds a check `if (prev == &amp;rq-&gt;leaf_cfs_rq_list)` after the main
conditional in child_cfs_rq_on_list. This ensures that the container_of
operation will convert a correct cfs_rq struct.

This check is sufficient because only cfs_rqs on the same CPU are added
to the list, so verifying the 'prev' pointer against the current rq's list
head is enough.

Fixes a potential memory corruption issue that due to current struct
layout might not be manifesting as a crash but could lead to unpredictable
behavior when the layout changes.

Fixes: fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after unthrottling")
Signed-off-by: Zecheng Li &lt;zecheng@google.com&gt;
Reviewed-and-tested-by: K Prateek Nayak &lt;kprateek.nayak@amd.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Link: https://lore.kernel.org/r/20250304214031.2882646-1-zecheng@google.com
</content>
</entry>
<entry>
<title>sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()</title>
<updated>2025-03-03T17:55:48Z</updated>
<author>
<name>Andrea Righi</name>
<email>arighi@nvidia.com</email>
</author>
<published>2025-03-03T17:51:59Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9360dfe4cbd62ff1eb8217b815964931523b75b3'/>
<id>urn:sha1:9360dfe4cbd62ff1eb8217b815964931523b75b3</id>
<content type='text'>
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids
range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel
crash.

To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and
trigger an scx error if an invalid CPU is specified.

Fixes: f0e1a0643a59b ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2025-03-01T01:00:16Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-03-01T01:00:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d203484f2556f47a435cda36ceb9dd83adc9056e'/>
<id>urn:sha1:d203484f2556f47a435cda36ceb9dd83adc9056e</id>
<content type='text'>
Pull scheduler fix from Ingo Molnar:
 "Prevent cond_resched() based preemption when interrupts are disabled,
  on PREEMPT_NONE and PREEMPT_VOLUNTARY kernels"

* tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Prevent rescheduling when interrupts are disabled
</content>
</entry>
<entry>
<title>sched/core: Prevent rescheduling when interrupts are disabled</title>
<updated>2025-02-27T20:13:57Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2024-12-16T13:20:56Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=82c387ef7568c0d96a918a5a78d9cad6256cfa15'/>
<id>urn:sha1:82c387ef7568c0d96a918a5a78d9cad6256cfa15</id>
<content type='text'>
David reported a warning observed while loop testing kexec jump:

  Interrupts enabled after irqrouter_resume+0x0/0x50
  WARNING: CPU: 0 PID: 560 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220
   kernel_kexec+0xf6/0x180
   __do_sys_reboot+0x206/0x250
   do_syscall_64+0x95/0x180

The corresponding interrupt flag trace:

  hardirqs last  enabled at (15573): [&lt;ffffffffa8281b8e&gt;] __up_console_sem+0x7e/0x90
  hardirqs last disabled at (15580): [&lt;ffffffffa8281b73&gt;] __up_console_sem+0x63/0x90

That means __up_console_sem() was invoked with interrupts enabled. Further
instrumentation revealed that in the interrupt disabled section of kexec
jump one of the syscore_suspend() callbacks woke up a task, which set the
NEED_RESCHED flag. A later callback in the resume path invoked
cond_resched() which in turn led to the invocation of the scheduler:

  __cond_resched+0x21/0x60
  down_timeout+0x18/0x60
  acpi_os_wait_semaphore+0x4c/0x80
  acpi_ut_acquire_mutex+0x3d/0x100
  acpi_ns_get_node+0x27/0x60
  acpi_ns_evaluate+0x1cb/0x2d0
  acpi_rs_set_srs_method_data+0x156/0x190
  acpi_pci_link_set+0x11c/0x290
  irqrouter_resume+0x54/0x60
  syscore_resume+0x6a/0x200
  kernel_kexec+0x145/0x1c0
  __do_sys_reboot+0xeb/0x240
  do_syscall_64+0x95/0x180

This is a long standing problem, which probably got more visible with
the recent printk changes. Something does a task wakeup and the
scheduler sets the NEED_RESCHED flag. cond_resched() sees it set and
invokes schedule() from a completely bogus context. The scheduler
enables interrupts after context switching, which causes the above
warning at the end.

Quite some of the code paths in syscore_suspend()/resume() can result in
triggering a wakeup with the exactly same consequences. They might not
have done so yet, but as they share a lot of code with normal operations
it's just a question of time.

The problem only affects the PREEMPT_NONE and PREEMPT_VOLUNTARY scheduling
models. Full preemption is not affected as cond_resched() is disabled and
the preemption check preemptible() takes the interrupt disabled flag into
account.

Cure the problem by adding a corresponding check into cond_resched().

Reported-by: David Woodhouse &lt;dwmw@amazon.co.uk&gt;
Suggested-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Tested-by: David Woodhouse &lt;dwmw@amazon.co.uk&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: stable@vger.kernel.org
Closes: https://lore.kernel.org/all/7717fe2ac0ce5f0a2c43fdab8b11f4483d54a2a4.camel@infradead.org
</content>
</entry>
</feed>
