<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sched, branch v5.10</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.10</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.10'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2020-12-09T08:37:43Z</updated>
<entry>
<title>membarrier: Execute SYNC_CORE on the calling thread</title>
<updated>2020-12-09T08:37:43Z</updated>
<author>
<name>Andy Lutomirski</name>
<email>luto@kernel.org</email>
</author>
<published>2020-12-04T05:07:06Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e45cdc71d1fa5ac3a57b23acc31eb959e4f60135'/>
<id>urn:sha1:e45cdc71d1fa5ac3a57b23acc31eb959e4f60135</id>
<content type='text'>
membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented as
syncing the core on all sibling threads but not necessarily the calling
thread.  This behavior is fundamentally buggy and cannot be used safely.

Suppose a user program has two threads.  Thread A is on CPU 0 and thread B
is on CPU 1.  Thread A modifies some text and calls
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).

Then thread B executes the modified code.  If, at any point after
membarrier() decides which CPUs to target, thread A could be preempted and
replaced by thread B on CPU 0.  This could even happen on exit from the
membarrier() syscall.  If this happens, thread B will end up running on CPU
0 without having synced.

In principle, this could be fixed by arranging for the scheduler to issue
sync_core_before_usermode() whenever switching between two threads in the
same mm if there is any possibility of a concurrent membarrier() call, but
this would have considerable overhead.  Instead, make membarrier() sync the
calling CPU as well.

As an optimization, this avoids an extra smp_mb() in the default
barrier-only mode and an extra rseq preempt on the caller.

Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski &lt;luto@kernel.org&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Link: https://lore.kernel.org/r/250ded637696d490c69bef1877148db86066881c.1607058304.git.luto@kernel.org

</content>
</entry>
<entry>
<title>membarrier: Explicitly sync remote cores when SYNC_CORE is requested</title>
<updated>2020-12-09T08:37:43Z</updated>
<author>
<name>Andy Lutomirski</name>
<email>luto@kernel.org</email>
</author>
<published>2020-12-04T05:07:05Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=758c9373d84168dc7d039cf85a0e920046b17b41'/>
<id>urn:sha1:758c9373d84168dc7d039cf85a0e920046b17b41</id>
<content type='text'>
membarrier() does not explicitly sync_core() remote CPUs; instead, it
relies on the assumption that an IPI will result in a core sync.  On x86,
this may be true in practice, but it's not architecturally reliable.  In
particular, the SDM and APM do not appear to guarantee that interrupt
delivery is serializing.  While IRET does serialize, IPI return can
schedule, thereby switching to another task in the same mm that was
sleeping in a syscall.  The new task could then SYSRET back to usermode
without ever executing IRET.

Make this more robust by explicitly calling sync_core_before_usermode()
on remote cores.  (This also helps people who search the kernel tree for
instances of sync_core() and sync_core_before_usermode() -- one might be
surprised that the core membarrier code doesn't currently show up in a
such a search.)

Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski &lt;luto@kernel.org&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/776b448d5f7bd6b12690707f5ed67bcda7f1d427.1607058304.git.luto@kernel.org

</content>
</entry>
<entry>
<title>membarrier: Add an actual barrier before rseq_preempt()</title>
<updated>2020-12-09T08:37:43Z</updated>
<author>
<name>Andy Lutomirski</name>
<email>luto@kernel.org</email>
</author>
<published>2020-12-04T05:07:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2ecedd7569080fd05c1a457e8af2165afecfa29f'/>
<id>urn:sha1:2ecedd7569080fd05c1a457e8af2165afecfa29f</id>
<content type='text'>
It seems that most RSEQ membarrier users will expect any stores done before
the membarrier() syscall to be visible to the target task(s).  While this
is extremely likely to be true in practice, nothing actually guarantees it
by a strict reading of the x86 manuals.  Rather than providing this
guarantee by accident and potentially causing a problem down the road, just
add an explicit barrier.

Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski &lt;luto@kernel.org&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/d3e7197e034fa4852afcf370ca49c30496e58e40.1607058304.git.luto@kernel.org

</content>
</entry>
<entry>
<title>Merge tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2020-11-29T19:19:26Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-11-29T19:19:26Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f91a3aa6bce480fe6e08df540129f4a923222419'/>
<id>urn:sha1:f91a3aa6bce480fe6e08df540129f4a923222419</id>
<content type='text'>
Pull locking fixes from Thomas Gleixner:
 "Two more places which invoke tracing from RCU disabled regions in the
  idle path.

  Similar to the entry path the low level idle functions have to be
  non-instrumentable"

* tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  intel_idle: Fix intel_idle() vs tracing
  sched/idle: Fix arch_cpu_idle() vs tracing
</content>
</entry>
<entry>
<title>sched/idle: Fix arch_cpu_idle() vs tracing</title>
<updated>2020-11-24T15:47:35Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2020-11-20T10:50:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=58c644ba512cfbc2e39b758dd979edd1d6d00e27'/>
<id>urn:sha1:58c644ba512cfbc2e39b758dd979edd1d6d00e27</id>
<content type='text'>
We call arch_cpu_idle() with RCU disabled, but then use
local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.

Switch all arch_cpu_idle() implementations to use
raw_local_irq_{en,dis}able() and carefully manage the
lockdep,rcu,tracing state like we do in entry.

(XXX: we really should change arch_cpu_idle() to not return with
interrupts enabled)

Reported-by: Sven Schnelle &lt;svens@linux.ibm.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Tested-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org
</content>
</entry>
<entry>
<title>Merge tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2020-11-22T21:26:07Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-11-22T21:26:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f4b936f5d6fd0625a78a7b4b92e98739a2bdb6f7'/>
<id>urn:sha1:f4b936f5d6fd0625a78a7b4b92e98739a2bdb6f7</id>
<content type='text'>
Pull scheduler fixes from Thomas Gleixner:
 "A couple of scheduler fixes:

   - Make the conditional update of the overutilized state work
     correctly by caching the relevant flags state before overwriting
     them and checking them afterwards.

   - Fix a data race in the wakeup path which caused loadavg on ARM64
     platforms to become a random number generator.

   - Fix the ordering of the iowaiter accounting operations so it can't
     be decremented before it is incremented.

   - Fix a bug in the deadline scheduler vs. priority inheritance when a
     non-deadline task A has inherited the parameters of a deadline task
     B and then blocks on a non-deadline task C.

     The second inheritance step used the static deadline parameters of
     task A, which are usually 0, instead of further propagating task
     B's parameters. The zero initialized parameters trigger a bug in
     the deadline scheduler"

* tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Fix priority inheritance with multiple scheduling classes
  sched: Fix rq-&gt;nr_iowait ordering
  sched: Fix data-race in wakeup
  sched/fair: Fix overutilized update in enqueue_task_fair()
</content>
</entry>
<entry>
<title>sched/deadline: Fix priority inheritance with multiple scheduling classes</title>
<updated>2020-11-17T12:15:28Z</updated>
<author>
<name>Juri Lelli</name>
<email>juri.lelli@redhat.com</email>
</author>
<published>2020-11-17T06:14:32Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=2279f540ea7d05f22d2f0c4224319330228586bc'/>
<id>urn:sha1:2279f540ea7d05f22d2f0c4224319330228586bc</id>
<content type='text'>
Glenn reported that "an application [he developed produces] a BUG in
deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
PTHREAD_PRIO_INHERIT mutexes.  I believe the bug is triggered when a CFS
task that was boosted by a SCHED_DEADLINE task boosts another CFS task
(nested priority inheritance).

 ------------[ cut here ]------------
 kernel BUG at kernel/sched/deadline.c:1462!
 invalid opcode: 0000 [#1] PREEMPT SMP
 CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
 Hardware name: ...
 RIP: 0010:enqueue_task_dl+0x335/0x910
 Code: ...
 RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
 FS:  00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  ? intel_pstate_update_util_hwp+0x13/0x170
  rt_mutex_setprio+0x1cc/0x4b0
  task_blocks_on_rt_mutex+0x225/0x260
  rt_spin_lock_slowlock_locked+0xab/0x2d0
  rt_spin_lock_slowlock+0x50/0x80
  hrtimer_grab_expiry_lock+0x20/0x30
  hrtimer_cancel+0x13/0x30
  do_nanosleep+0xa0/0x150
  hrtimer_nanosleep+0xe1/0x230
  ? __hrtimer_init_sleeper+0x60/0x60
  __x64_sys_nanosleep+0x8d/0xa0
  do_syscall_64+0x4a/0x100
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7fa58b52330d
 ...
 ---[ end trace 0000000000000002 ]—

He also provided a simple reproducer creating the situation below:

 So the execution order of locking steps are the following
 (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
 are mutexes that are enabled * with priority inheritance.)

 Time moves forward as this timeline goes down:

 N1              N2               D1
 |               |                |
 |               |                |
 Lock(M1)        |                |
 |               |                |
 |             Lock(M2)           |
 |               |                |
 |               |              Lock(M2)
 |               |                |
 |             Lock(M1)           |
 |             (!!bug triggered!) |

Daniel reported a similar situation as well, by just letting ksoftirqd
run with DEADLINE (and eventually block on a mutex).

Problem is that boosted entities (Priority Inheritance) use static
DEADLINE parameters of the top priority waiter. However, there might be
cases where top waiter could be a non-DEADLINE entity that is currently
boosted by a DEADLINE entity from a different lock chain (i.e., nested
priority chains involving entities of non-DEADLINE classes). In this
case, top waiter static DEADLINE parameters could be null (initialized
to 0 at fork()) and replenish_dl_entity() would hit a BUG().

Fix this by keeping track of the original donor and using its parameters
when a task is boosted.

Reported-by: Glenn Elliott &lt;glenn@aurora.tech&gt;
Reported-by: Daniel Bristot de Oliveira &lt;bristot@redhat.com&gt;
Signed-off-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Tested-by: Daniel Bristot de Oliveira &lt;bristot@redhat.com&gt;
Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
</content>
</entry>
<entry>
<title>sched: Fix rq-&gt;nr_iowait ordering</title>
<updated>2020-11-17T12:15:28Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2020-09-24T11:50:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ec618b84f6e15281cc3660664d34cd0dd2f2579e'/>
<id>urn:sha1:ec618b84f6e15281cc3660664d34cd0dd2f2579e</id>
<content type='text'>
  schedule()				ttwu()
    deactivate_task();			  if (p-&gt;on_rq &amp;&amp; ...) // false
					    atomic_dec(&amp;task_rq(p)-&gt;nr_iowait);
    if (prev-&gt;in_iowait)
      atomic_inc(&amp;rq-&gt;nr_iowait);

Allows nr_iowait to be decremented before it gets incremented,
resulting in more dodgy IO-wait numbers than usual.

Note that because we can now do ttwu_queue_wakelist() before
p-&gt;on_cpu==0, we lose the natural ordering and have to further delay
the decrement.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p-&gt;on_cpu")
Reported-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Link: https://lkml.kernel.org/r/20201117093829.GD3121429@hirez.programming.kicks-ass.net
</content>
</entry>
<entry>
<title>sched/fair: Fix overutilized update in enqueue_task_fair()</title>
<updated>2020-11-17T12:15:27Z</updated>
<author>
<name>Quentin Perret</name>
<email>qperret@google.com</email>
</author>
<published>2020-11-12T11:12:01Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8e1ac4299a6e8726de42310d9c1379f188140c71'/>
<id>urn:sha1:8e1ac4299a6e8726de42310d9c1379f188140c71</id>
<content type='text'>
enqueue_task_fair() attempts to skip the overutilized update for new
tasks as their util_avg is not accurate yet. However, the flag we check
to do so is overwritten earlier on in the function, which makes the
condition pretty much a nop.

Fix this by saving the flag early on.

Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
Reported-by: Rick Yiu &lt;rickyiu@google.com&gt;
Signed-off-by: Quentin Perret &lt;qperret@google.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Reviewed-by: Valentin Schneider &lt;valentin.schneider@arm.com&gt;
Link: https://lkml.kernel.org/r/20201112111201.2081902-1-qperret@google.com
</content>
</entry>
<entry>
<title>Merge tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2020-11-15T17:39:35Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-11-15T17:39:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d0a37fd57fbae32adffb56ae9852d551376b7c9b'/>
<id>urn:sha1:d0a37fd57fbae32adffb56ae9852d551376b7c9b</id>
<content type='text'>
Pull scheduler fixes from Thomas Gleixner:
 "A set of scheduler fixes:

   - Address a load balancer regression by making the load balancer use
     the same logic as the wakeup path to spread tasks in the LLC domain

   - Prefer the CPU on which a task run last over the local CPU in the
     fast wakeup path for asymmetric CPU capacity systems to align with
     the symmetric case. This ensures more locality and prevents massive
     migration overhead on those asymetric systems

   - Fix a memory corruption bug in the scheduler debug code caused by
     handing a modified buffer pointer to kfree()"

* tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Fix memory corruption caused by multiple small reads of flags
  sched/fair: Prefer prev cpu in asymmetric wakeup path
  sched/fair: Ensure tasks spreading in LLC during LB
</content>
</entry>
</feed>
