linux/kernel/sched.c, branch v2.6.26

sched: fix cpu hotplug, cleanup

2008-07-10T18:39:58Z

Clean up __migrate_task(): to just have separate "done" and "fail" cases, instead of that "out" case with random error behavior. Signed-off-by: Ingo Molnar

sched: fix cpu hotplug

2008-07-10T07:35:34Z

I think we may have a race between try_to_wake_up() and migrate_live_tasks() -> move_task_off_dead_cpu() when the later one may end up looping endlessly. Interrupts are enabled on other CPUs when migration_call(CPU_DEAD, ...) is called so we may get a race between try_to_wake_up() and migrate_live_tasks() -> move_task_off_dead_cpu(). The former one may push a task out of a dead CPU causing the later one to loop endlessly. Heiko Carstens observed: | That's exactly what explains a dump I got yesterday. Thanks for fixing! :) Signed-off-by: Dmitry Adamushko Cc: miaox@cn.fujitsu.com Cc: Lai Jiangshan Cc: Heiko Carstens Cc: Peter Zijlstra Cc: Avi Kivity Cc: Andrew Morton Signed-off-by: Ingo Molnar

sched: fix divide error when trying to configure rt_period to zero

2008-07-01T06:23:24Z

Here it is another little Oops we found while configuring invalid values via cgroups: echo 0 > /dev/cgroups/0/cpu.rt_period_us or echo 4294967296 > /dev/cgroups/0/cpu.rt_period_us [ 205.509825] divide error: 0000 [#1] [ 205.510151] Modules linked in: [ 205.510151] [ 205.510151] Pid: 2339, comm: bash Not tainted (2.6.26-rc8 #33) [ 205.510151] EIP: 0060:[] EFLAGS: 00000293 CPU: 0 [ 205.510151] EIP is at div64_u64+0x5f/0x70 [ 205.510151] EAX: 0000389f EBX: 00000000 ECX: 00000000 EDX: 00000000 [ 205.510151] ESI: d9800000 EDI: 00000000 EBP: c6cede60 ESP: c6cede50 [ 205.510151] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 205.510151] Process bash (pid: 2339, ti=c6cec000 task=c79be370 task.ti=c6cec000) [ 205.510151] Stack: d9800000 0000389f c05971a0 d9800000 c6cedeb4 c0214dbd 00000000 00000000 [ 205.510151] c6cede88 c0242bd8 c05377c0 c7a41b40 00000000 00000000 00000000 c05971a0 [ 205.510151] c780ed20 c7508494 c7a41b40 00000000 00000002 c6cedebc c05971a0 ffffffea [ 205.510151] Call Trace: [ 205.510151] [] ? __rt_schedulable+0x1cd/0x240 [ 205.510151] [] ? cgroup_file_open+0x18/0xe0 [ 205.510151] [] ? tg_set_bandwidth+0xa4/0xf0 [ 205.510151] [] ? sched_group_set_rt_period+0x36/0x50 [ 205.510151] [] ? cpu_rt_period_write_uint+0xe/0x10 [ 205.510151] [] ? cgroup_file_write+0x125/0x160 [ 205.510151] [] ? hrtimer_interrupt+0x155/0x190 [ 205.510151] [] ? security_file_permission+0xf/0x20 [ 205.510151] [] ? rw_verify_area+0x48/0xc0 [ 205.510151] [] ? dupfd+0x104/0x130 [ 205.510151] [] ? vfs_write+0x9c/0x160 [ 205.510151] [] ? cgroup_file_write+0x0/0x160 [ 205.510151] [] ? sys_write+0x3d/0x70 [ 205.510151] [] ? sysenter_past_esp+0x6a/0x91 [ 205.510151] ======================= [ 205.510151] Code: 0f 45 de 31 f6 0f ad d0 d3 ea f6 c1 20 0f 45 c2 0f 45 d6 89 45 f0 89 55 f4 8b 55 f4 31 c9 8b 45 f0 39 d3 89 c6 77 08 89 d0 31 d2 f3 89 c1 83 c4 08 89 f0 f7 f3 89 ca 5b 5e 5d c3 55 89 e5 56 [ 205.510151] EIP: [] div64_u64+0x5f/0x70 SS:ESP 0068:c6cede50 The attached patch solves the issue for me. I'm checking as soon as possible for the period not being zero since, if it is, going ahead is useless. This way we also save a mutex_lock() and a read_lock() wrt doing it inside tg_set_bandwidth() or __rt_schedulable(). Signed-off-by: Dario Faggioli Signed-off-by: Michael Trimarchi Signed-off-by: Ingo Molnar

sched: fix cpu hotplug

2008-06-29T06:50:21Z

the CPU hotplug problems (crashes under high-volume unplug+replug tests) seem to be related to migrate_dead_tasks(). Firstly I added traces to see all tasks being migrated with migrate_live_tasks() and migrate_dead_tasks(). On my setup the problem pops up (the one with "se == NULL" in the loop of pick_next_task_fair()) shortly after the traces indicate that some has been migrated with migrate_dead_tasks()). btw., I can reproduce it much faster now with just a plain cpu down/up loop. [disclaimer] Well, unless I'm really missing something important in this late hour [/desclaimer] pick_next_task() is not something appropriate for migrate_dead_tasks() :-) the following change seems to eliminate the problem on my setup (although, I kept it running only for a few minutes to get a few messages indicating migrate_dead_tasks() does move tasks and the system is still ok) Signed-off-by: Ingo Molnar

Merge branch 'linus' into sched/urgent

2008-06-23T09:00:26Z

Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

2008-06-20T19:37:13Z

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression rcupreempt: remove export of rcu_batches_completed_bh cpuset: limit the input of cpuset.sched_relax_domain_level

sched: refactor wait_for_completion_timeout()

2008-06-20T15:15:49Z

Simplify the code and fix the boundary condition of wait_for_completion_timeout(,0). We can kill the first __remove_wait_queue() as well. Signed-off-by: Ingo Molnar Acked-by: Peter Zijlstra

sched: fix wait_for_completion_timeout() spurious failure under heavy load

2008-06-20T11:19:32Z

It seems that the current implementaton of wait_for_completion_timeout() has a small problem under very high load for the common pattern: if (!wait_for_completion_timeout(&done, timeout)) /* handle failure */ because the implementation very roughly does (lots of code deleted to show the basic flow): static inline long __sched do_wait_for_common(struct completion *x, long timeout, int state) { if (x->done) return timeout; do { timeout = schedule_timeout(timeout); if (!timeout) return timeout; } while (!x->done); return timeout; } so if the system is very busy and x->done is not set when do_wait_for_common() is entered, it is possible that the first call to schedule_timeout() returns 0 because the task doing wait_for_completion doesn't get rescheduled for a long time, even if it is woken up early enough. In this case, wait_for_completion_timeout() returns 0 without even checking x->done again, and the code above falls into its failure case purely for scheduler reasons, even if the hardware event or whatever was being waited for happened early enough. It would make sense to add an extra test to do_wait_for() in the timeout case and return 1 if x->done is actually set. A quick audit (not exhaustive) of wait_for_completion_timeout() callers seems to indicate that no one actually cares about the return value in the success case -- they just test for 0 (timed out) versus non-zero (wait succeeded). Signed-off-by: Ingo Molnar

cpuset: limit the input of cpuset.sched_relax_domain_level

2008-06-19T07:45:36Z

We allow the inputs to be [-1 ... SD_LV_MAX), and return -EINVAL for inputs outside this range. Signed-off-by: Li Zefan Acked-by: Paul Menage Acked-by: Paul Jackson Acked-by: Hidetoshi Seto Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner

sched: CPU hotplug events must not destroy scheduler domains created by the cpusets

2008-06-19T07:14:51Z

First issue is not related to the cpusets. We're simply leaking doms_cur. It's allocated in arch_init_sched_domains() which is called for every hotplug event. So we just keep reallocation doms_cur without freeing it. I introduced free_sched_domains() function that cleans things up. Second issue is that sched domains created by the cpusets are completely destroyed by the CPU hotplug events. For all CPU hotplug events scheduler attaches all CPUs to the NULL domain and then puts them all into the single domain thereby destroying domains created by the cpusets (partition_sched_domains). The solution is simple, when cpusets are enabled scheduler should not create default domain and instead let cpusets do that. Which is exactly what the patch does. Signed-off-by: Max Krasnyansky Cc: pj@sgi.com Cc: menage@google.com Cc: rostedt@goodmis.org Acked-by: Peter Zijlstra Signed-off-by: Thomas Gleixner