From 536bd00cdbb7b908573e5a93bae67b64cbae60d8 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Fri, 6 May 2016 14:58:43 +0200 Subject: sched/fair: Fix !CONFIG_SMP kernel cpufreq governor breakage The following commit: 34e2c555f3e1 ("cpufreq: Add mechanism for registering utilization update callbacks") overlooked the fact that update_load_avg(), where CFS invokes cpufreq utilization update callbacks, becomes an empty stub on UP kernels. In consequence, if !CONFIG_SMP, cpufreq governors are never invoked from CFS and they do not have a chance to evaluate CPU performace levels and update them often enough. Needless to say, things don't work as expected then. Fix the problem by making the !CONFIG_SMP stub of update_load_avg() invoke cpufreq update callbacks too. Reported-by: Steve Muckle Tested-by: Steve Muckle Signed-off-by: Rafael J. Wysocki Acked-by: Steve Muckle Cc: Linus Torvalds Cc: Linux PM list Cc: Peter Zijlstra Cc: Srinivas Pandruvada Cc: Thomas Gleixner Cc: Viresh Kumar Fixes: 34e2c555f3e1 (cpufreq: Add mechanism for registering utilization update callbacks) Link: http://lkml.kernel.org/r/6282396.VVEdgVYxO3@vostro.rjw.lan Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0fe30e66aff1..40748dc8ea3e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3030,7 +3030,14 @@ static int idle_balance(struct rq *this_rq); #else /* CONFIG_SMP */ -static inline void update_load_avg(struct sched_entity *se, int update_tg) {} +static inline void update_load_avg(struct sched_entity *se, int not_used) +{ + struct cfs_rq *cfs_rq = cfs_rq_of(se); + struct rq *rq = rq_of(cfs_rq); + + cpufreq_trigger_update(rq_clock(rq)); +} + static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} static inline void -- cgit v1.2.3 From 4f41fc59620fcedaa97cbdf3d7d2956d80fcd922 Mon Sep 17 00:00:00 2001 From: "Serge E. Hallyn" Date: Mon, 9 May 2016 09:59:55 -0500 Subject: cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces Patch summary: When showing a cgroupfs entry in mountinfo, show the path of the mount root dentry relative to the reader's cgroup namespace root. Short explanation (courtesy of mkerrisk): If we create a new cgroup namespace, then we want both /proc/self/cgroup and /proc/self/mountinfo to show cgroup paths that are correctly virtualized with respect to the cgroup mount point. Previous to this patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo does not. Long version: When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup namespace, and then mounts a new instance of the freezer cgroup, the new mount will be rooted at /a/b. The root dentry field of the mountinfo entry will show '/a/b'. cat > /tmp/do1 << EOF mount -t cgroup -o freezer freezer /mnt grep freezer /proc/self/mountinfo EOF unshare -Gm bash /tmp/do1 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer The task's freezer cgroup entry in /proc/self/cgroup will simply show '/': grep freezer /proc/self/cgroup 9:freezer:/ If instead the same task simply bind mounts the /a/b cgroup directory, the resulting mountinfo entry will again show /a/b for the dentry root. However in this case the task will find its own cgroup at /mnt/a/b, not at /mnt: mount --bind /sys/fs/cgroup/freezer/a/b /mnt 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer In other words, there is no way for the task to know, based on what is in mountinfo, which cgroup directory is its own. Example (by mkerrisk): First, a little script to save some typing and verbiage: echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)" cat /proc/self/mountinfo | grep freezer | awk '{print "\tmountinfo:\t\t" $4 "\t" $5}' Create cgroup, place this shell into the cgroup, and look at the state of the /proc files: 2653 2653 # Our shell 14254 # cat(1) /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer Create a shell in new cgroup and mount namespaces. The act of creating a new cgroup namespace causes the process's current cgroups directories to become its cgroup root directories. (Here, I'm using my own version of the "unshare" utility, which takes the same options as the util-linux version): Look at the state of the /proc files: /proc/self/cgroup: 10:freezer:/ mountinfo: / /sys/fs/cgroup/freezer The third entry in /proc/self/cgroup (the pathname of the cgroup inside the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which is rooted at /a/b in the outer namespace. However, the info in /proc/self/mountinfo is not for this cgroup namespace, since we are seeing a duplicate of the mount from the old mount namespace, and the info there does not correspond to the new cgroup namespace. However, trying to create a new mount still doesn't show us the right information in mountinfo: # propagating to other mountns /proc/self/cgroup: 7:freezer:/ mountinfo: /a/b /mnt/freezer The act of creating a new cgroup namespace caused the process's current freezer directory, "/a/b", to become its cgroup freezer root directory. In other words, the pathname directory of the directory within the newly mounted cgroup filesystem should be "/", but mountinfo wrongly shows us "/a/b". The consequence of this is that the process in the cgroup namespace cannot correctly construct the pathname of its cgroup root directory from the information in /proc/PID/mountinfo. With this patch, the dentry root field in mountinfo is shown relative to the reader's cgroup namespace. So the same steps as above: /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release 3164 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) Signed-off-by: Serge Hallyn Tested-by: Michael Kerrisk Acked-by: Michael Kerrisk Signed-off-by: Tejun Heo --- fs/kernfs/mount.c | 14 +++++++++++ include/linux/kernfs.h | 2 ++ kernel/cgroup.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 79 insertions(+) (limited to 'kernel') diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index f73541fbe7af..3b78724c1979 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -15,6 +15,7 @@ #include #include #include +#include #include "kernfs-internal.h" @@ -40,6 +41,18 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry) return 0; } +static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) +{ + struct kernfs_node *node = dentry->d_fsdata; + struct kernfs_root *root = kernfs_root(node); + struct kernfs_syscall_ops *scops = root->syscall_ops; + + if (scops && scops->show_path) + return scops->show_path(sf, node, root); + + return seq_dentry(sf, dentry, " \t\n\\"); +} + const struct super_operations kernfs_sops = { .statfs = simple_statfs, .drop_inode = generic_delete_inode, @@ -47,6 +60,7 @@ const struct super_operations kernfs_sops = { .remount_fs = kernfs_sop_remount_fs, .show_options = kernfs_sop_show_options, + .show_path = kernfs_sop_show_path, }; /** diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index c06c44242f39..30f089ebe0a4 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -152,6 +152,8 @@ struct kernfs_syscall_ops { int (*rmdir)(struct kernfs_node *kn); int (*rename)(struct kernfs_node *kn, struct kernfs_node *new_parent, const char *new_name); + int (*show_path)(struct seq_file *sf, struct kernfs_node *kn, + struct kernfs_root *root); }; struct kernfs_root { diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 909a7d31ffd3..afea39eb7649 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1215,6 +1215,41 @@ static void cgroup_destroy_root(struct cgroup_root *root) cgroup_free_root(root); } +/* + * look up cgroup associated with current task's cgroup namespace on the + * specified hierarchy + */ +static struct cgroup * +current_cgns_cgroup_from_root(struct cgroup_root *root) +{ + struct cgroup *res = NULL; + struct css_set *cset; + + lockdep_assert_held(&css_set_lock); + + rcu_read_lock(); + + cset = current->nsproxy->cgroup_ns->root_cset; + if (cset == &init_css_set) { + res = &root->cgrp; + } else { + struct cgrp_cset_link *link; + + list_for_each_entry(link, &cset->cgrp_links, cgrp_link) { + struct cgroup *c = link->cgrp; + + if (c->root == root) { + res = c; + break; + } + } + } + rcu_read_unlock(); + + BUG_ON(!res); + return res; +} + /* look up cgroup associated with given css_set on the specified hierarchy */ static struct cgroup *cset_cgroup_from_root(struct css_set *cset, struct cgroup_root *root) @@ -1593,6 +1628,33 @@ static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask) return 0; } +static int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node, + struct kernfs_root *kf_root) +{ + int len = 0, ret = 0; + char *buf = NULL; + struct cgroup_root *kf_cgroot = cgroup_root_from_kf(kf_root); + struct cgroup *ns_cgroup; + + buf = kmalloc(PATH_MAX, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + spin_lock_bh(&css_set_lock); + ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot); + len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX); + spin_unlock_bh(&css_set_lock); + + if (len >= PATH_MAX) + len = -ERANGE; + else if (len > 0) { + seq_escape(sf, buf, " \t\n\\"); + len = 0; + } + kfree(buf); + return len; +} + static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root) { @@ -5433,6 +5495,7 @@ static struct kernfs_syscall_ops cgroup_kf_syscall_ops = { .mkdir = cgroup_mkdir, .rmdir = cgroup_rmdir, .rename = cgroup_rename, + .show_path = cgroup_show_path, }; static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early) -- cgit v1.2.3 From 0161028b7c8aebef64194d3d73e43bc3b53b5c66 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Mon, 9 May 2016 15:48:51 -0700 Subject: perf/core: Change the default paranoia level to 2 Allowing unprivileged kernel profiling lets any user dump follow kernel control flow and dump kernel registers. This most likely allows trivial kASLR bypassing, and it may allow other mischief as well. (Off the top of my head, the PERF_SAMPLE_REGS_INTR output during /dev/urandom reads could be quite interesting.) Signed-off-by: Andy Lutomirski Acked-by: Kees Cook Signed-off-by: Linus Torvalds --- Documentation/sysctl/kernel.txt | 2 +- kernel/events/core.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 57653a44b128..fcddfd5ded99 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -645,7 +645,7 @@ allowed to execute. perf_event_paranoid: Controls use of the performance events system by unprivileged -users (without CAP_SYS_ADMIN). The default value is 1. +users (without CAP_SYS_ADMIN). The default value is 2. -1: Allow use of (almost) all events by all users >=0: Disallow raw tracepoint access by users without CAP_IOC_LOCK diff --git a/kernel/events/core.c b/kernel/events/core.c index 4e2ebf6f2f1f..c0ded2416615 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -351,7 +351,7 @@ static struct srcu_struct pmus_srcu; * 1 - disallow cpu events for unpriv * 2 - disallow kernel profiling for unpriv */ -int sysctl_perf_event_paranoid __read_mostly = 1; +int sysctl_perf_event_paranoid __read_mostly = 2; /* Minimum for 512 kiB + 1 user control page */ int sysctl_perf_event_mlock __read_mostly = 512 + (PAGE_SIZE / 1024); /* 'free' kiB per user */ -- cgit v1.2.3 From 13b5ab02ae118fc8dfdc2b8597688ec4a11d5b53 Mon Sep 17 00:00:00 2001 From: Xunlei Pang Date: Mon, 9 May 2016 12:11:31 +0800 Subject: sched/rt, sched/dl: Don't push if task's scheduling class was changed We got this warning: WARNING: CPU: 1 PID: 2468 at kernel/sched/core.c:1161 set_task_cpu+0x1af/0x1c0 [...] Call Trace: dump_stack+0x63/0x87 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 set_task_cpu+0x1af/0x1c0 push_dl_task.part.34+0xea/0x180 push_dl_tasks+0x17/0x30 __balance_callback+0x45/0x5c __sched_setscheduler+0x906/0xb90 SyS_sched_setattr+0x150/0x190 do_syscall_64+0x62/0x110 entry_SYSCALL64_slow_path+0x25/0x25 This corresponds to: WARN_ON_ONCE(p->state == TASK_RUNNING && p->sched_class == &fair_sched_class && (p->on_rq && !task_on_rq_migrating(p))) It happens because in find_lock_later_rq(), the task whose scheduling class was changed to fair class is still pushed away as if it were a deadline task ... So, check in find_lock_later_rq() after double_lock_balance(), if the scheduling class of the deadline task was changed, break and retry. Apply the same logic to RT tasks. Signed-off-by: Xunlei Pang Reviewed-by: Steven Rostedt Acked-by: Peter Zijlstra Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Juri Lelli Link: http://lkml.kernel.org/r/1462767091-1215-1-git-send-email-xlpang@redhat.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 1 + kernel/sched/rt.c | 1 + 2 files changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index affd97ec9f65..686ec8adf952 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1394,6 +1394,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq) !cpumask_test_cpu(later_rq->cpu, &task->cpus_allowed) || task_running(rq, task) || + !dl_task(task) || !task_on_rq_queued(task))) { double_unlock_balance(rq, later_rq); later_rq = NULL; diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index c41ea7ac1764..ec4f538d4396 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1729,6 +1729,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq) !cpumask_test_cpu(lowest_rq->cpu, tsk_cpus_allowed(task)) || task_running(rq, task) || + !rt_task(task) || !task_on_rq_queued(task))) { double_unlock_balance(rq, lowest_rq); -- cgit v1.2.3 From 53d3bc773eaa7ab1cf63585e76af7ee869d5e709 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Wed, 11 May 2016 08:25:53 +0200 Subject: Revert "sched/fair: Fix fairness issue on migration" Mike reported that this recent commit: 3a47d5124a95 ("sched/fair: Fix fairness issue on migration") ... broke interactivity and the signal starvation test. We have a proper fix series in the works but ran out of time for v4.6, so revert the commit. Reported-by: Mike Galbraith Acked-by: Peter Zijlstra Cc: Thomas Gleixner Cc: Linus Torvalds Cc: Peter Zijlstra Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 20 ++++++-------------- 1 file changed, 6 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 40748dc8ea3e..e7dd0ec169be 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3188,25 +3188,17 @@ static inline void check_schedstat_required(void) static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { - bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING); - bool curr = cfs_rq->curr == se; - /* - * If we're the current task, we must renormalise before calling - * update_curr(). + * Update the normalized vruntime before updating min_vruntime + * through calling update_curr(). */ - if (renorm && curr) + if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING)) se->vruntime += cfs_rq->min_vruntime; - update_curr(cfs_rq); - /* - * Otherwise, renormalise after, such that we're placed at the current - * moment in time, instead of some random moment in the past. + * Update run-time statistics of the 'current'. */ - if (renorm && !curr) - se->vruntime += cfs_rq->min_vruntime; - + update_curr(cfs_rq); enqueue_entity_load_avg(cfs_rq, se); account_entity_enqueue(cfs_rq, se); update_cfs_shares(cfs_rq); @@ -3222,7 +3214,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) update_stats_enqueue(cfs_rq, se); check_spread(cfs_rq, se); } - if (!curr) + if (se != cfs_rq->curr) __enqueue_entity(cfs_rq, se); se->on_rq = 1; -- cgit v1.2.3 From 9f448cd3cbcec8995935e60b27802ae56aac8cc0 Mon Sep 17 00:00:00 2001 From: Alexander Shishkin Date: Tue, 10 May 2016 16:18:33 +0300 Subject: perf/core: Disable the event on a truncated AUX record When the PMU driver reports a truncated AUX record, it effectively means that there is no more usable room in the event's AUX buffer (even though there may still be some room, so that perf_aux_output_begin() doesn't take action). At this point the consumer still has to be woken up and the event has to be disabled, otherwise the event will just keep spinning between perf_aux_output_begin() and perf_aux_output_end() until its context gets unscheduled. Again, for cpu-wide events this means never, so once in this condition, they will be forever losing data. Fix this by disabling the event and waking up the consumer in case of a truncated AUX record. Reported-by: Markus Metzger Signed-off-by: Alexander Shishkin Signed-off-by: Peter Zijlstra (Intel) Cc: Cc: Arnaldo Carvalho de Melo Cc: Arnaldo Carvalho de Melo Cc: Borislav Petkov Cc: Jiri Olsa Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Stephane Eranian Cc: Thomas Gleixner Cc: Vince Weaver Cc: vince@deater.net Link: http://lkml.kernel.org/r/1462886313-13660-3-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/events/ring_buffer.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index c61f0cbd308b..7611d0f66cf8 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -347,6 +347,7 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size, bool truncated) { struct ring_buffer *rb = handle->rb; + bool wakeup = truncated; unsigned long aux_head; u64 flags = 0; @@ -375,9 +376,16 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size, aux_head = rb->user_page->aux_head = local_read(&rb->aux_head); if (aux_head - local_read(&rb->aux_wakeup) >= rb->aux_watermark) { - perf_output_wakeup(handle); + wakeup = true; local_add(rb->aux_watermark, &rb->aux_wakeup); } + + if (wakeup) { + if (truncated) + handle->event->pending_disable = 1; + perf_output_wakeup(handle); + } + handle->event = NULL; local_set(&rb->aux_nest, 0); -- cgit v1.2.3 From 09be4c824ebdbf3c043a07d2d9537a0164a1ecfe Mon Sep 17 00:00:00 2001 From: Felipe Balbi Date: Thu, 12 May 2016 12:34:38 +0300 Subject: cgroup: fix compile warning MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit commit 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") added the following compile warning: kernel/cgroup.c: In function ‘cgroup_show_path’: kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable] int len = 0, ret = 0; ^ fix it. Fixes: 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") Signed-off-by: Felipe Balbi Signed-off-by: Tejun Heo --- kernel/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index afea39eb7649..86cb5c6e8932 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1631,7 +1631,7 @@ static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask) static int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node, struct kernfs_root *kf_root) { - int len = 0, ret = 0; + int len = 0; char *buf = NULL; struct cgroup_root *kf_cgroot = cgroup_root_from_kf(kf_root); struct cgroup *ns_cgroup; -- cgit v1.2.3 From f7c17d26f43d5cc1b7a6b896cd2fa24a079739b9 Mon Sep 17 00:00:00 2001 From: Wanpeng Li Date: Wed, 11 May 2016 17:55:18 +0800 Subject: workqueue: fix rebind bound workers warning MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ------------[ cut here ]------------ WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0 Modules linked in: CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31 Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013 0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010 0000000000000000 0000000000000000 0000000000000000 ffff881037babba8 ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046 Call Trace: dump_stack+0x89/0xd4 __warn+0xfd/0x120 warn_slowpath_null+0x1d/0x20 rebind_workers+0x1c0/0x1d0 workqueue_cpu_up_callback+0xf5/0x1d0 notifier_call_chain+0x64/0x90 ? trace_hardirqs_on_caller+0xf2/0x220 ? notify_prepare+0x80/0x80 __raw_notifier_call_chain+0xe/0x10 __cpu_notify+0x35/0x50 notify_down_prepare+0x5e/0x80 ? notify_prepare+0x80/0x80 cpuhp_invoke_callback+0x73/0x330 ? __schedule+0x33e/0x8a0 cpuhp_down_callbacks+0x51/0xc0 cpuhp_thread_fun+0xc1/0xf0 smpboot_thread_fn+0x159/0x2a0 ? smpboot_create_threads+0x80/0x80 kthread+0xef/0x110 ? wait_for_completion+0xf0/0x120 ? schedule_tail+0x35/0xf0 ret_from_fork+0x22/0x50 ? __init_kthread_worker+0x70/0x70 ---[ end trace eb12ae47d2382d8f ]--- notify_down_prepare: attempt to take down CPU 0 failed This bug can be reproduced by below config w/ nohz_full= all cpus: CONFIG_BOOTPARAM_HOTPLUG_CPU0=y CONFIG_DEBUG_HOTPLUG_CPU0=y CONFIG_NO_HZ_FULL=y As Thomas pointed out: | If a down prepare callback fails, then DOWN_FAILED is invoked for all | callbacks which have successfully executed DOWN_PREPARE. | | But, workqueue has actually two notifiers. One which handles | UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE. | | Now look at the priorities of those callbacks: | | CPU_PRI_WORKQUEUE_UP = 5 | CPU_PRI_WORKQUEUE_DOWN = -5 | | So the call order on DOWN_PREPARE is: | | CB 1 | CB ... | CB workqueue_up() -> Ignores DOWN_PREPARE | CB ... | CB X ---> Fails | | So we call up to CB X with DOWN_FAILED | | CB 1 | CB ... | CB workqueue_up() -> Handles DOWN_FAILED | CB ... | CB X-1 | | So the problem is that the workqueue stuff handles DOWN_FAILED in the up | callback, while it should do it in the down callback. Which is not a good idea | either because it wants to be called early on rollback... | | Brilliant stuff, isn't it? The hotplug rework will solve this problem because | the callbacks become symetric, but for the existing mess, we need some | workaround in the workqueue code. The boot CPU handles housekeeping duty(unbound timers, workqueues, timekeeping, ...) on behalf of full dynticks CPUs. It must remain online when nohz full is enabled. There is a priority set to every notifier_blocks: workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down So tick_nohz_cpu_down callback failed when down prepare cpu 0, and notifier_blocks behind tick_nohz_cpu_down will not be called any more, which leads to workers are actually not unbound. Then hotplug state machine will fallback to undo and online cpu 0 again. Workers will be rebound unconditionally even if they are not unbound and trigger the warning in this progress. This patch fix it by catching !DISASSOCIATED to avoid rebind bound workers. Cc: Tejun Heo Cc: Lai Jiangshan Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Frédéric Weisbecker Cc: stable@vger.kernel.org Suggested-by: Lai Jiangshan Signed-off-by: Wanpeng Li --- kernel/workqueue.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 801a698564d4..1b2e36b2f31f 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4555,6 +4555,17 @@ static void rebind_workers(struct worker_pool *pool) pool->attrs->cpumask) < 0); spin_lock_irq(&pool->lock); + + /* + * XXX: CPU hotplug notifiers are weird and can call DOWN_FAILED + * w/o preceding DOWN_PREPARE. Work around it. CPU hotplug is + * being reworked and this can go away in time. + */ + if (!(pool->flags & POOL_DISASSOCIATED)) { + spin_unlock_irq(&pool->lock); + return; + } + pool->flags &= ~POOL_DISASSOCIATED; for_each_pool_worker(worker, pool) { -- cgit v1.2.3