<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sched/core.c, branch v5.8</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.8</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.8'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2020-07-24T12:40:25Z</updated>
<entry>
<title>sched: Warn if garbage is passed to default_wake_function()</title>
<updated>2020-07-24T12:40:25Z</updated>
<author>
<name>Chris Wilson</name>
<email>chris@chris-wilson.co.uk</email>
</author>
<published>2020-07-23T20:10:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=062d3f95b630113e1156a31f376ad36e25da29a7'/>
<id>urn:sha1:062d3f95b630113e1156a31f376ad36e25da29a7</id>
<content type='text'>
Since the default_wake_function() passes its flags onto
try_to_wake_up(), warn if those flags collide with internal values.

Given that the supplied flags are garbage, no repair can be done but at
least alert the user to the damage they are causing.

In the belief that these errors should be picked up during testing, the
warning is only compiled in under CONFIG_SCHED_DEBUG.

Signed-off-by: Chris Wilson &lt;chris@chris-wilson.co.uk&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Link: https://lore.kernel.org/r/20200723201042.18861-1-chris@chris-wilson.co.uk
</content>
</entry>
<entry>
<title>sched: Fix race against ptrace_freeze_trace()</title>
<updated>2020-07-22T08:22:00Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2020-07-20T15:20:21Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d136122f58458479fd8926020ba2937de61d7f65'/>
<id>urn:sha1:d136122f58458479fd8926020ba2937de61d7f65</id>
<content type='text'>
There is apparently one site that violates the rule that only current
and ttwu() will modify task-&gt;state, namely ptrace_{,un}freeze_traced()
will change task-&gt;state for a remote task.

Oleg explains:

  "TASK_TRACED/TASK_STOPPED was always protected by siglock. In
particular, ttwu(__TASK_TRACED) must be always called with siglock
held. That is why ptrace_freeze_traced() assumes it can safely do
s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock)."

This breaks the ordering scheme introduced by commit:

  dbfb089d360b ("sched: Fix loadavg accounting race")

Specifically, the reload not matching no longer implies we don't have
to block.

Simply things by noting that what we need is a LOAD-&gt;STORE ordering
and this can be provided by a control dependency.

So replace:

	prev_state = prev-&gt;state;
	raw_spin_lock(&amp;rq-&gt;lock);
	smp_mb__after_spinlock(); /* SMP-MB */
	if (... &amp;&amp; prev_state &amp;&amp; prev_state == prev-&gt;state)
		deactivate_task();

with:

	prev_state = prev-&gt;state;
	if (... &amp;&amp; prev_state) /* CTRL-DEP */
		deactivate_task();

Since that already implies the 'prev-&gt;state' load must be complete
before allowing the 'prev-&gt;on_rq = 0' store to become visible.

Fixes: dbfb089d360b ("sched: Fix loadavg accounting race")
Reported-by: Jiri Slaby &lt;jirislaby@kernel.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Tested-by: Paul Gortmaker &lt;paul.gortmaker@windriver.com&gt;
Tested-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
<entry>
<title>sched: Fix unreliable rseq cpu_id for new tasks</title>
<updated>2020-07-08T09:38:50Z</updated>
<author>
<name>Mathieu Desnoyers</name>
<email>mathieu.desnoyers@efficios.com</email>
</author>
<published>2020-07-06T20:49:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ce3614daabea8a2d01c1dd17ae41d1ec5e5ae7db'/>
<id>urn:sha1:ce3614daabea8a2d01c1dd17ae41d1ec5e5ae7db</id>
<content type='text'>
While integrating rseq into glibc and replacing glibc's sched_getcpu
implementation with rseq, glibc's tests discovered an issue with
incorrect __rseq_abi.cpu_id field value right after the first time
a newly created process issues sched_setaffinity.

For the records, it triggers after building glibc and running tests, and
then issuing:

  for x in {1..2000} ; do posix/tst-affinity-static  &amp; done

and shows up as:

error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0

This is caused by the scheduler invoking __set_task_cpu() directly from
sched_fork() and wake_up_new_task(), thus bypassing rseq_migrate() which
is done by set_task_cpu().

Add the missing rseq_migrate() to both functions. The only other direct
use of __set_task_cpu() is done by init_idle(), which does not involve a
user-space task.

Based on my testing with the glibc test-case, just adding rseq_migrate()
to wake_up_new_task() is sufficient to fix the observed issue. Also add
it to sched_fork() to keep things consistent.

The reason why this never triggered so far with the rseq/basic_test
selftest is unclear.

The current use of sched_getcpu(3) does not typically require it to be
always accurate. However, use of the __rseq_abi.cpu_id field within rseq
critical sections requires it to be accurate. If it is not accurate, it
can cause corruption in the per-cpu data targeted by rseq critical
sections in user-space.

Reported-By: Florian Weimer &lt;fweimer@redhat.com&gt;
Signed-off-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-By: Florian Weimer &lt;fweimer@redhat.com&gt;
Cc: stable@vger.kernel.org # v4.18+
Link: https://lkml.kernel.org/r/20200707201505.2632-1-mathieu.desnoyers@efficios.com
</content>
</entry>
<entry>
<title>sched: Fix loadavg accounting race</title>
<updated>2020-07-08T09:38:49Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2020-07-03T10:40:33Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=dbfb089d360b1cc623c51a2c7cf9b99eff78e0e7'/>
<id>urn:sha1:dbfb089d360b1cc623c51a2c7cf9b99eff78e0e7</id>
<content type='text'>
The recent commit:

  c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p-&gt;on_cpu")

moved these lines in ttwu():

	p-&gt;sched_contributes_to_load = !!task_contributes_to_load(p);
	p-&gt;state = TASK_WAKING;

up before:

	smp_cond_load_acquire(&amp;p-&gt;on_cpu, !VAL);

into the 'p-&gt;on_rq == 0' block, with the thinking that once we hit
schedule() the current task cannot change it's -&gt;state anymore. And
while this is true, it is both incorrect and flawed.

It is incorrect in that we need at least an ACQUIRE on 'p-&gt;on_rq == 0'
to avoid weak hardware from re-ordering things for us. This can fairly
easily be achieved by relying on the control-dependency already in
place.

The second problem, which makes the flaw in the original argument, is
that while schedule() will not change prev-&gt;state, it will read it a
number of times (arguably too many times since it's marked volatile).
The previous condition 'p-&gt;on_cpu == 0' was sufficient because that
indicates schedule() has completed, and will no longer read
prev-&gt;state. So now the trick is to make this same true for the (much)
earlier 'prev-&gt;on_rq == 0' case.

Furthermore, in order to make the ordering stick, the 'prev-&gt;on_rq = 0'
assignment needs to he a RELEASE, but adding additional ordering to
schedule() is an unwelcome proposition at the best of times, doubly so
for mere accounting.

Luckily we can push the prev-&gt;state load up before rq-&gt;lock, with the
only caveat that we then have to re-read the state after. However, we
know that if it changed, we no longer have to worry about the blocking
path. This gives us the required ordering, if we block, we did the
prev-&gt;state load before an (effective) smp_mb() and the p-&gt;on_rq store
needs not change.

With this we end up with the effective ordering:

	LOAD p-&gt;state           LOAD-ACQUIRE p-&gt;on_rq == 0
	MB
	STORE p-&gt;on_rq, 0       STORE p-&gt;state, TASK_WAKING

which ensures the TASK_WAKING store happens after the prev-&gt;state
load, and all is well again.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p-&gt;on_cpu")
Reported-by: Dave Jones &lt;davej@codemonkey.org.uk&gt;
Reported-by: Paul Gortmaker &lt;paul.gortmaker@windriver.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Dave Jones &lt;davej@codemonkey.org.uk&gt;
Tested-by: Paul Gortmaker &lt;paul.gortmaker@windriver.com&gt;
Link: https://lkml.kernel.org/r/20200707102957.GN117543@hirez.programming.kicks-ass.net
</content>
</entry>
<entry>
<title>smp, irq_work: Continue smp_call_function*() and irq_work*() integration</title>
<updated>2020-06-28T15:01:20Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2020-06-22T10:01:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8c4890d1c3358fb8023d46e1e554c41d54f02878'/>
<id>urn:sha1:8c4890d1c3358fb8023d46e1e554c41d54f02878</id>
<content type='text'>
Instead of relying on BUG_ON() to ensure the various data structures
line up, use a bunch of horrible unions to make it all automatic.

Much of the union magic is to ensure irq_work and smp_call_function do
not (yet) see the members of their respective data structures change
name.

Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Link: https://lkml.kernel.org/r/20200622100825.844455025@infradead.org
</content>
</entry>
<entry>
<title>sched/core: s/WF_ON_RQ/WQ_ON_CPU/</title>
<updated>2020-06-28T15:01:20Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2020-06-22T10:01:24Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=739f70b476cf05c5a424b42a8b5728914345610c'/>
<id>urn:sha1:739f70b476cf05c5a424b42a8b5728914345610c</id>
<content type='text'>
Use a better name for this poorly named flag, to avoid confusion...

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Link: https://lkml.kernel.org/r/20200622100825.785115830@infradead.org
</content>
</entry>
<entry>
<title>sched/core: Fix ttwu() race</title>
<updated>2020-06-28T15:01:20Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2020-06-22T10:01:23Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b6e13e85829f032411b896bd2f0d6cbe4b0a3c4a'/>
<id>urn:sha1:b6e13e85829f032411b896bd2f0d6cbe4b0a3c4a</id>
<content type='text'>
Paul reported rcutorture occasionally hitting a NULL deref:

  sched_ttwu_pending()
    ttwu_do_wakeup()
      check_preempt_curr() := check_preempt_wakeup()
        find_matching_se()
          is_same_group()
            if (se-&gt;cfs_rq == pse-&gt;cfs_rq) &lt;-- *BOOM*

Debugging showed that this only appears to happen when we take the new
code-path from commit:

  2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling")

and only when @cpu == smp_processor_id(). Something which should not
be possible, because p-&gt;on_cpu can only be true for remote tasks.
Similarly, without the new code-path from commit:

  c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p-&gt;on_cpu")

this would've unconditionally hit:

  smp_cond_load_acquire(&amp;p-&gt;on_cpu, !VAL);

and if: 'cpu == smp_processor_id() &amp;&amp; p-&gt;on_cpu' is possible, this
would result in an instant live-lock (with IRQs disabled), something
that hasn't been reported.

The NULL deref can be explained however if the task_cpu(p) load at the
beginning of try_to_wake_up() returns an old value, and this old value
happens to be smp_processor_id(). Further assume that the p-&gt;on_cpu
load accurately returns 1, it really is still running, just not here.

Then, when we enqueue the task locally, we can crash in exactly the
observed manner because p-&gt;se.cfs_rq != rq-&gt;cfs_rq, because p's cfs_rq
is from the wrong CPU, therefore we'll iterate into the non-existant
parents and NULL deref.

The closest semi-plausible scenario I've managed to contrive is
somewhat elaborate (then again, actual reproduction takes many CPU
hours of rcutorture, so it can't be anything obvious):

					X-&gt;cpu = 1
					rq(1)-&gt;curr = X

	CPU0				CPU1				CPU2

					// switch away from X
					LOCK rq(1)-&gt;lock
					smp_mb__after_spinlock
					dequeue_task(X)
					  X-&gt;on_rq = 9
					switch_to(Z)
					  X-&gt;on_cpu = 0
					UNLOCK rq(1)-&gt;lock

									// migrate X to cpu 0
									LOCK rq(1)-&gt;lock
									dequeue_task(X)
									set_task_cpu(X, 0)
									  X-&gt;cpu = 0
									UNLOCK rq(1)-&gt;lock

									LOCK rq(0)-&gt;lock
									enqueue_task(X)
									  X-&gt;on_rq = 1
									UNLOCK rq(0)-&gt;lock

	// switch to X
	LOCK rq(0)-&gt;lock
	smp_mb__after_spinlock
	switch_to(X)
	  X-&gt;on_cpu = 1
	UNLOCK rq(0)-&gt;lock

	// X goes sleep
	X-&gt;state = TASK_UNINTERRUPTIBLE
	smp_mb();			// wake X
					ttwu()
					  LOCK X-&gt;pi_lock
					  smp_mb__after_spinlock

					  if (p-&gt;state)

					  cpu = X-&gt;cpu; // =? 1

					  smp_rmb()

	// X calls schedule()
	LOCK rq(0)-&gt;lock
	smp_mb__after_spinlock
	dequeue_task(X)
	  X-&gt;on_rq = 0

					  if (p-&gt;on_rq)

					  smp_rmb();

					  if (p-&gt;on_cpu &amp;&amp; ttwu_queue_wakelist(..)) [*]

					  smp_cond_load_acquire(&amp;p-&gt;on_cpu, !VAL)

					  cpu = select_task_rq(X, X-&gt;wake_cpu, ...)
					  if (X-&gt;cpu != cpu)
	switch_to(Y)
	  X-&gt;on_cpu = 0
	UNLOCK rq(0)-&gt;lock

However I'm having trouble convincing myself that's actually possible
on x86_64 -- after all, every LOCK implies an smp_mb() there, so if ttwu
observes -&gt;state != RUNNING, it must also observe -&gt;cpu != 1.

(Most of the previous ttwu() races were found on very large PowerPC)

Nevertheless, this fully explains the observed failure case.

Fix it by ordering the task_cpu(p) load after the p-&gt;on_cpu load,
which is easy since nothing actually uses @cpu before this.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p-&gt;on_cpu")
Reported-by: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Tested-by: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Link: https://lkml.kernel.org/r/20200622125649.GC576871@hirez.programming.kicks-ass.net
</content>
</entry>
<entry>
<title>sched/core: Fix PI boosting between RT and DEADLINE tasks</title>
<updated>2020-06-28T15:01:20Z</updated>
<author>
<name>Juri Lelli</name>
<email>juri.lelli@redhat.com</email>
</author>
<published>2018-11-19T15:32:01Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=740797ce3a124b7dd22b7fb832d87bc8fba1cf6f'/>
<id>urn:sha1:740797ce3a124b7dd22b7fb832d87bc8fba1cf6f</id>
<content type='text'>
syzbot reported the following warning:

 WARNING: CPU: 1 PID: 6351 at kernel/sched/deadline.c:628
 enqueue_task_dl+0x22da/0x38a0 kernel/sched/deadline.c:1504

At deadline.c:628 we have:

 623 static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
 624 {
 625 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 626 	struct rq *rq = rq_of_dl_rq(dl_rq);
 627
 628 	WARN_ON(dl_se-&gt;dl_boosted);
 629 	WARN_ON(dl_time_before(rq_clock(rq), dl_se-&gt;deadline));
        [...]
     }

Which means that setup_new_dl_entity() has been called on a task
currently boosted. This shouldn't happen though, as setup_new_dl_entity()
is only called when the 'dynamic' deadline of the new entity
is in the past w.r.t. rq_clock and boosted tasks shouldn't verify this
condition.

Digging through the PI code I noticed that what above might in fact happen
if an RT tasks blocks on an rt_mutex hold by a DEADLINE task. In the
first branch of boosting conditions we check only if a pi_task 'dynamic'
deadline is earlier than mutex holder's and in this case we set mutex
holder to be dl_boosted. However, since RT 'dynamic' deadlines are only
initialized if such tasks get boosted at some point (or if they become
DEADLINE of course), in general RT 'dynamic' deadlines are usually equal
to 0 and this verifies the aforementioned condition.

Fix it by checking that the potential donor task is actually (even if
temporary because in turn boosted) running at DEADLINE priority before
using its 'dynamic' deadline value.

Fixes: 2d3d891d3344 ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
Reported-by: syzbot+119ba87189432ead09b4@syzkaller.appspotmail.com
Signed-off-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Reviewed-by: Daniel Bristot de Oliveira &lt;bristot@redhat.com&gt;
Tested-by: Daniel Wagner &lt;dwagner@suse.de&gt;
Link: https://lkml.kernel.org/r/20181119153201.GB2119@localhost.localdomain
</content>
</entry>
<entry>
<title>sched/core: Check cpus_mask, not cpus_ptr in __set_cpus_allowed_ptr(), to fix mask corruption</title>
<updated>2020-06-28T15:01:20Z</updated>
<author>
<name>Scott Wood</name>
<email>swood@redhat.com</email>
</author>
<published>2020-06-17T12:17:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=fd844ba9ae59b51e34e77105d79f8eca780b3bd6'/>
<id>urn:sha1:fd844ba9ae59b51e34e77105d79f8eca780b3bd6</id>
<content type='text'>
This function is concerned with the long-term CPU mask, not the
transitory mask the task might have while migrate disabled.  Before
this patch, if a task was migrate-disabled at the time
__set_cpus_allowed_ptr() was called, and the new mask happened to be
equal to the CPU that the task was running on, then the mask update
would be lost.

Signed-off-by: Scott Wood &lt;swood@redhat.com&gt;
Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Link: https://lkml.kernel.org/r/20200617121742.cpxppyi7twxmpin7@linutronix.de
</content>
</entry>
<entry>
<title>kernel: rename show_stack_loglvl() =&gt; show_stack()</title>
<updated>2020-06-09T16:39:13Z</updated>
<author>
<name>Dmitry Safonov</name>
<email>dima@arista.com</email>
</author>
<published>2020-06-09T04:32:29Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9cb8f069deeed708bf19486d5893e297dc467ae0'/>
<id>urn:sha1:9cb8f069deeed708bf19486d5893e297dc467ae0</id>
<content type='text'>
Now the last users of show_stack() got converted to use an explicit log
level, show_stack_loglvl() can drop it's redundant suffix and become once
again well known show_stack().

Signed-off-by: Dmitry Safonov &lt;dima@arista.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20200418201944.482088-51-dima@arista.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
