linux/kernel/rcu, branch v5.4

Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2019-09-17T00:25:49Z

Pull scheduler updates from Ingo Molnar: - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann, Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers. As perf and the scheduler is getting bigger and more complex, document the status quo of current responsibilities and interests, and spread the review pain^H^H^H^H fun via an increase in the Cc: linecount generated by scripts/get_maintainer.pl. :-) - Add another series of patches that brings the -rt (PREEMPT_RT) tree closer to mainline: split the monolithic CONFIG_PREEMPT dependencies into a new CONFIG_PREEMPTION category that will allow the eventual introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches to go though. - Extend the CPU cgroup controller with uclamp.min and uclamp.max to allow the finer shaping of CPU bandwidth usage. - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS). - Improve the behavior of high CPU count, high thread count applications running under cpu.cfs_quota_us constraints. - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present. - Improve CPU isolation housekeeping CPU allocation NUMA locality. - Fix deadline scheduler bandwidth calculations and logic when cpusets rebuilds the topology, or when it gets deadline-throttled while it's being offlined. - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from setscheduler() system calls without creating global serialization. Add new synchronization between cpuset topology-changing events and the deadline acceptance tests in setscheduler(), which were broken before. - Rework the active_mm state machine to be less confusing and more optimal. - Rework (simplify) the pick_next_task() slowpath. - Improve load-balancing on AMD EPYC systems. - ... and misc cleanups, smaller fixes and improvements - please see the Git log for more details. * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) sched/psi: Correct overly pessimistic size calculation sched/fair: Speed-up energy-aware wake-ups sched/uclamp: Always use 'enum uclamp_id' for clamp_id values sched/uclamp: Update CPU's refcount on TG's clamp changes sched/uclamp: Use TG's clamps to restrict TASK's clamps sched/uclamp: Propagate system defaults to the root group sched/uclamp: Propagate parent clamps sched/uclamp: Extend CPU's cgroup controller sched/topology: Improve load balancing on AMD EPYC systems arch, ia64: Make NUMA select SMP sched, perf: MAINTAINERS update, add submaintainers and reviewers sched/fair: Use rq_lock/unlock in online_fair_sched_group cpufreq: schedutil: fix equation in comment sched: Rework pick_next_task() slow-path sched: Allow put_prev_task() to drop rq->lock sched/fair: Expose newidle_balance() sched: Add task_struct pointer to sched_class::set_curr_task sched: Rework CPU hotplug task selection sched/{rt,deadline}: Fix set_next_task vs pick_next_task sched: Fix kerneldoc comment for ia64_set_curr_task ...

Merge branch 'sched/rt' into sched/core, to pick up -rt changes

2019-09-16T12:05:04Z

Pick up the first couple of patches working towards PREEMPT_RT. Signed-off-by: Ingo Molnar

rcu: Allow rcu_do_batch() to dynamically adjust batch sizes

2019-08-13T21:38:24Z

Bimodal behavior of rcu_do_batch() is not really suited to Google applications like gfe servers. When a process with millions of sockets exits, closing all files queues two rcu callbacks per socket. This eventually reaches the point where RCU enters an emergency mode, where rcu_do_batch() do not return until whole queue is flushed. Each rcu callback lasts at least 70 nsec, so with millions of elements, we easily spend more than 100 msec without rescheduling. Goal of this patch is to avoid the infamous message like following "need_resched set for > 51999388 ns (52 ticks) without schedule" We dynamically adjust the number of elements we process, instead of 10 / INFINITE choices, we use a floor of ~1 % of current entries. If the number is above 1000, we switch to a time based limit of 3 msec per batch, adjustable with /sys/module/rcutree/parameters/rcu_resched_ns Signed-off-by: Eric Dumazet [ paulmck: Forward-port and remove debug statements. ] Signed-off-by: Paul E. McKenney

rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload

2019-08-13T21:38:24Z

When under overload conditions, __call_rcu_nocb_wake() will wake the no-CBs GP kthread any time the no-CBs CB kthread is asleep or there are no ready-to-invoke callbacks, but only after a timer delay. If the no-CBs GP kthread has a ->nocb_bypass_timer pending, the deferred wakeup from __call_rcu_nocb_wake() is redundant. This commit therefore makes __call_rcu_nocb_wake() avoid posting the redundant deferred wakeup if ->nocb_bypass_timer is pending. This requires adding a bit of ordering of timer actions. Signed-off-by: Paul E. McKenney

rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention

2019-08-13T21:38:24Z

Currently, __call_rcu_nocb_wake() advances callbacks each time that it detects excessive numbers of callbacks, though only if it succeeds in conditionally acquiring its leaf rcu_node structure's ->lock. Despite the conditional acquisition of ->lock, this does increase contention. This commit therefore avoids advancing callbacks unless there are callbacks in ->cblist whose grace period has completed and advancing has not yet been done during this jiffy. Note that this decision does not take the presence of new callbacks into account. That is because on this code path, there will always be at least one new callback, namely the one we just enqueued. Signed-off-by: Paul E. McKenney

rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention

2019-08-13T21:38:24Z

Currently, nocb_cb_wait() advances callbacks on each pass through its loop, though only if it succeeds in conditionally acquiring its leaf rcu_node structure's ->lock. Despite the conditional acquisition of ->lock, this does increase contention. This commit therefore avoids advancing callbacks unless there are callbacks in ->cblist whose grace period has completed. Note that nocb_cb_wait() doesn't worry about callbacks that have not yet been assigned a grace period. The idea is that the only reason for nocb_cb_wait() to advance callbacks is to allow it to continue invoking callbacks. Time will tell whether this is the correct choice. Signed-off-by: Paul E. McKenney

rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()

2019-08-13T21:38:24Z

The rcutree_migrate_callbacks() invokes rcu_advance_cbs() on both the offlined CPU's ->cblist and that of the surviving CPU, then merges them. However, after the merge, and of the offlined CPU's callbacks that were not ready to be invoked will no longer be associated with a grace-period number. This commit therefore invokes rcu_advance_cbs() one more time on the merged ->cblist in order to assign a grace-period number to these callbacks. Signed-off-by: Paul E. McKenney

rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()

2019-08-13T21:38:24Z

When callbacks are in full flow, the common case is waiting for a grace period, and this grace period will normally take a few jiffies to complete. It therefore isn't all that helpful for __call_rcu_nocb_wake() to do a synchronous wakeup in this case. This commit therefore turns this into a timer-based deferred wakeup of the no-CBs grace-period kthread. Signed-off-by: Paul E. McKenney

rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed

2019-08-13T21:38:24Z

This commit causes locking, sleeping, and callback state to be printed for no-CBs CPUs when the rcutorture writer is delayed sufficiently for rcutorture to complain. Signed-off-by: Paul E. McKenney

rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended

2019-08-13T21:38:24Z

Signed-off-by: Paul E. McKenney