<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sched, branch v3.18</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v3.18</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v3.18'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2014-12-04T04:55:58Z</updated>
<entry>
<title>context_tracking: Restore previous state in schedule_user</title>
<updated>2014-12-04T04:55:58Z</updated>
<author>
<name>Andy Lutomirski</name>
<email>luto@amacapital.net</email>
</author>
<published>2014-12-03T23:37:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=7cc78f8fa02c2485104b86434acbc1538a3bd807'/>
<id>urn:sha1:7cc78f8fa02c2485104b86434acbc1538a3bd807</id>
<content type='text'>
It appears that some SCHEDULE_USER (asm for schedule_user) callers
in arch/x86/kernel/entry_64.S are called from RCU kernel context,
and schedule_user will return in RCU user context.  This causes RCU
warnings and possible failures.

This is intended to be a minimal fix suitable for 3.18.

Reported-and-tested-by: Dave Jones &lt;davej@redhat.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Frédéric Weisbecker &lt;fweisbec@gmail.com&gt;
Acked-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Signed-off-by: Andy Lutomirski &lt;luto@amacapital.net&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sched: Provide update_curr callbacks for stop/idle scheduling classes</title>
<updated>2014-11-23T22:14:40Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2014-11-23T22:04:52Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=90e362f4a75d0911ca75e5cd95591a6cf1f169dc'/>
<id>urn:sha1:90e362f4a75d0911ca75e5cd95591a6cf1f169dc</id>
<content type='text'>
Chris bisected a NULL pointer deference in task_sched_runtime() to
commit 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency'.

Chris observed crashes in atop or other /proc walking programs when he
started fork bombs on his machine.  He assumed that this is a new exit
race, but that does not make any sense when looking at that commit.

What's interesting is that, the commit provides update_curr callbacks
for all scheduling classes except stop_task and idle_task.

While nothing can ever hit that via the clock_nanosleep() and
clock_gettime() interfaces, which have been the target of the commit in
question, the author obviously forgot that there are other code paths
which invoke task_sched_runtime()

do_task_stat(()
 thread_group_cputime_adjusted()
   thread_group_cputime()
     task_cputime()
       task_sched_runtime()
        if (task_current(rq, p) &amp;&amp; task_on_rq_queued(p)) {
          update_rq_clock(rq);
          up-&gt;sched_class-&gt;update_curr(rq);
        }

If the stats are read for a stomp machine task, aka 'migration/N' and
that task is current on its cpu, this will happily call the NULL pointer
of stop_task-&gt;update_curr.  Ooops.

Chris observation that this happens faster when he runs the fork bomb
makes sense as the fork bomb will kick migration threads more often so
the probability to hit the issue will increase.

Add the missing update_curr callbacks to the scheduler classes stop_task
and idle_task.  While idle tasks cannot be monitored via /proc we have
other means to hit the idle case.

Fixes: 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
Reported-by: Chris Mason &lt;clm@fb.com&gt;
Reported-and-tested-by: Borislav Petkov &lt;bp@alien8.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Stanislaw Gruszka &lt;sgruszka@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency</title>
<updated>2014-11-16T09:04:20Z</updated>
<author>
<name>Stanislaw Gruszka</name>
<email>sgruszka@redhat.com</email>
</author>
<published>2014-11-12T15:58:44Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6e998916dfe327e785e7c2447959b2c1a3ea4930'/>
<id>urn:sha1:6e998916dfe327e785e7c2447959b2c1a3ea4930</id>
<content type='text'>
Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&amp;Y) can result
of Y time being smaller than X time.

Reproducer/tester can be found further below, it can be compiled and ran by:

	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
	while ./tst-cpuclock2 ; do : ; done

This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".

Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.

KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .

This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.

Full reproducer (tst-cpuclock2.c):

	#define _GNU_SOURCE
	#include &lt;unistd.h&gt;
	#include &lt;sys/syscall.h&gt;
	#include &lt;stdio.h&gt;
	#include &lt;time.h&gt;
	#include &lt;pthread.h&gt;
	#include &lt;stdint.h&gt;
	#include &lt;inttypes.h&gt;

	/* Parameters for the Linux kernel ABI for CPU clocks.  */
	#define CPUCLOCK_SCHED          2
	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
		((~(clockid_t) (pid) &lt;&lt; 3) | (clockid_t) (clock))

	static pthread_barrier_t barrier;

	/* Help advance the clock.  */
	static void *chew_cpu(void *arg)
	{
		pthread_barrier_wait(&amp;barrier);
		while (1) ;

		return NULL;
	}

	/* Don't use the glibc wrapper.  */
	static int do_nanosleep(int flags, const struct timespec *req)
	{
		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
	}

	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
	{
		int64_t before_i = before-&gt;tv_sec * 1000000000ULL + before-&gt;tv_nsec;
		int64_t after_i = after-&gt;tv_sec * 1000000000ULL + after-&gt;tv_nsec;

		return after_i - before_i;
	}

	int main(void)
	{
		int result = 0;
		pthread_t th;

		pthread_barrier_init(&amp;barrier, NULL, 2);

		if (pthread_create(&amp;th, NULL, chew_cpu, NULL) != 0) {
			perror("pthread_create");
			return 1;
		}

		pthread_barrier_wait(&amp;barrier);

		/* The test.  */
		struct timespec before, after, sleeptimeabs;
		int64_t sleepdiff, diffabs;
		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

		/* The relative nanosleep.  Not sure why this is needed, but its presence
		   seems to make it easier to reproduce the problem.  */
		if (do_nanosleep(0, &amp;sleeptime) != 0) {
			perror("clock_nanosleep");
			return 1;
		}

		/* Get the current time.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;before) &lt; 0) {
			perror("clock_gettime[2]");
			return 1;
		}

		/* Compute the absolute sleep time based on the current time.  */
		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
		sleeptimeabs.tv_nsec = nsec % 1000000000;

		/* Sleep for the computed time.  */
		if (do_nanosleep(TIMER_ABSTIME, &amp;sleeptimeabs) != 0) {
			perror("absolute clock_nanosleep");
			return 1;
		}

		/* Get the time after the sleep.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;after) &lt; 0) {
			perror("clock_gettime[3]");
			return 1;
		}

		/* The time after sleep should always be equal to or after the absolute sleep
		   time passed to clock_nanosleep.  */
		sleepdiff = tsdiff(&amp;sleeptimeabs, &amp;after);
		if (sleepdiff &lt; 0) {
			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
			result = 1;

			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
		}

		/* The difference between the timestamps taken before and after the
		   clock_nanosleep call should be equal to or more than the duration of the
		   sleep.  */
		diffabs = tsdiff(&amp;before, &amp;after);
		if (diffabs &lt; sleeptime.tv_nsec) {
			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
			result = 1;
		}

		pthread_cancel(th);

		return result;
	}

Signed-off-by: Stanislaw Gruszka &lt;sgruszka@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/cputime: Fix cpu_timer_sample_group() double accounting</title>
<updated>2014-11-16T09:04:18Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2014-11-12T11:37:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=23cfa361f3e54a3e184a5e126bbbdd95f984881a'/>
<id>urn:sha1:23cfa361f3e54a3e184a5e126bbbdd95f984881a</id>
<content type='text'>
While looking over the cpu-timer code I found that we appear to add
the delta for the calling task twice, through:

  cpu_timer_sample_group()
    thread_group_cputimer()
      thread_group_cputime()
        times-&gt;sum_exec_runtime += task_sched_runtime();

    *sample = cputime.sum_exec_runtime + task_delta_exec();

Which would make the sample run ahead, making the sleep short.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Stanislaw Gruszka &lt;sgruszka@redhat.com&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/numa: Avoid selecting oneself as swap target</title>
<updated>2014-11-16T09:04:17Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2014-11-10T09:54:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=7af683350cb0ddd0e9d3819b4eb7abe9e2d3e709'/>
<id>urn:sha1:7af683350cb0ddd0e9d3819b4eb7abe9e2d3e709</id>
<content type='text'>
Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.

Reported-and-Tested-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/numa: Fix out of bounds read in sched_init_numa()</title>
<updated>2014-11-10T09:33:22Z</updated>
<author>
<name>Andrey Ryabinin</name>
<email>a.ryabinin@samsung.com</email>
</author>
<published>2014-11-07T14:53:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c123588b3b193d06588dfb51f475407f835ebfb2'/>
<id>urn:sha1:c123588b3b193d06588dfb51f475407f835ebfb2</id>
<content type='text'>
On latest mm + KASan patchset I've got this:

    ==================================================================
    BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c
    =============================================================================
    BUG kmalloc-8 (Not tainted): kasan error
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0
     __slab_alloc+0x4b4/0x4f0
     __kmalloc_track_caller+0x15f/0x1e0
     kstrdup+0x44/0x90
     alloc_vfsmnt+0xb0/0x2c0
     vfs_kern_mount+0x35/0x190
     kern_mount_data+0x25/0x50
     pid_ns_prepare_proc+0x19/0x50
     alloc_pid+0x5e2/0x630
     copy_process.part.41+0xdf5/0x2aa0
     do_fork+0xf5/0x460
     kernel_thread+0x21/0x30
     rest_init+0x1e/0x90
     start_kernel+0x522/0x531
     x86_64_start_reservations+0x2a/0x2c
     x86_64_start_kernel+0x15b/0x16a
    INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080
    INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70

    Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
    Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5                          proc.kk.
    Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc                          ........
    Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
    CPU: 0 PID: 1 Comm: swapper/0 Tainted: G    B          3.18.0-rc3-mm1+ #108
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
     ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18
     ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48
     ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    print_trailer (mm/slub.c:645)
    object_err (mm/slub.c:652)
    ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178)
    ? kasan_poison_shadow (mm/kasan/kasan.c:48)
    ? kasan_unpoison_shadow (mm/kasan/kasan.c:54)
    ? kasan_poison_shadow (mm/kasan/kasan.c:48)
    ? kasan_kmalloc (mm/kasan/kasan.c:311)
    __asan_load4 (mm/kasan/kasan.c:371)
    ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
    kernel_init_freeable (init/main.c:869 init/main.c:997)
    ? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248)
    ? rest_init (init/main.c:924)
    kernel_init (init/main.c:929)
    ? rest_init (init/main.c:924)
    ret_from_fork (arch/x86/kernel/entry_64.S:348)
    ? rest_init (init/main.c:924)
    Read of size 4 by task swapper/0:
    Memory state around the buggy address:
     ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc
     ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
     ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
     ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc
     ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    &gt;ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc
                                                              ^
     ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
     ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
     ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
     ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
     ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

Zero 'level' (e.g. on non-NUMA system) causing out of bounds
access in this line:

     sched_max_numa_distance = sched_domains_numa_distance[level - 1];

Fix this by exiting from sched_init_numa() earlier.

Signed-off-by: Andrey Ryabinin &lt;a.ryabinin@samsung.com&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Fixes: 9942f79ba ("sched/numa: Export info needed for NUMA balancing on complex topologies")
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched: Remove lockdep check in sched_move_task()</title>
<updated>2014-11-04T06:07:30Z</updated>
<author>
<name>Kirill Tkhai</name>
<email>ktkhai@parallels.com</email>
</author>
<published>2014-10-28T05:24:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f7b8a47da17c9ee4998f2ca2018fcc424e953c0e'/>
<id>urn:sha1:f7b8a47da17c9ee4998f2ca2018fcc424e953c0e</id>
<content type='text'>
sched_move_task() is the only interface to change sched_task_group:
cpu_cgrp_subsys methods and autogroup_move_group() use it.

Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
is ordered with other users of sched_move_task(). This means we do no
need RCU here: if we've dereferenced a tg here, the .attach method
hasn't been called for it yet.

Thus, we should pass "true" to task_css_check() to silence lockdep
warnings.

Fixes: eeb61e53ea19 ("sched: Fix race between task_group and sched_task_group")
Reported-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reported-by: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Kirill Tkhai &lt;ktkhai@parallels.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1414473874.8574.2.camel@tkhai
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/dl: Fix preemption checks</title>
<updated>2014-10-28T09:46:10Z</updated>
<author>
<name>Kirill Tkhai</name>
<email>ktkhai@parallels.com</email>
</author>
<published>2014-10-21T16:35:56Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f3a7e1a9c464a32ee186ab91388313c82e7ce018'/>
<id>urn:sha1:f3a7e1a9c464a32ee186ab91388313c82e7ce018</id>
<content type='text'>
1) switched_to_dl() check is wrong. We reschedule only
   if rq-&gt;curr is deadline task, and we do not reschedule
   if it's a lower priority task. But we must always
   preempt a task of other classes.

2) dl_task_timer():
   Policy does not change in case of priority inheritance.
   rt_mutex_setprio() changes prio, while policy remains old.

So we lose some balancing logic in dl_task_timer() and
switched_to_dl() when we check policy instead of priority. Boosted
task may be rq-&gt;curr.

(I didn't change switched_from_dl() because no check is necessary
there at all).

I've looked at this place(switched_to_dl) several times and even fixed
this function, but found just now...  I suppose some performance tests
may work better after this.

Signed-off-by: Kirill Tkhai &lt;ktkhai@parallels.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Juri Lelli &lt;juri.lelli@gmail.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched: stop the unbound recursion in preempt_schedule_context()</title>
<updated>2014-10-28T09:46:05Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2014-10-05T20:23:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=009f60e2763568cdcd75bd1cf360c7c7165e2e60'/>
<id>urn:sha1:009f60e2763568cdcd75bd1cf360c7c7165e2e60</id>
<content type='text'>
preempt_schedule_context() does preempt_enable_notrace() at the end
and this can call the same function again; exception_exit() is heavy
and it is quite possible that need-resched is true again.

1. Change this code to dec preempt_count() and check need_resched()
   by hand.

2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid
   the enable/disable dance around __schedule(). But in this case
   we need to move into sched/core.c.

3. Cosmetic, but x86 forgets to declare this function. This doesn't
   really matter because it is only called by asm helpers, still it
   make sense to add the declaration into asm/preempt.h to match
   preempt_schedule().

Reported-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Alexander Graf &lt;agraf@suse.de&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Masami Hiramatsu &lt;masami.hiramatsu.pt@hitachi.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Peter Anvin &lt;hpa@zytor.com&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Cc: Denys Vlasenko &lt;dvlasenk@redhat.com&gt;
Cc: Chuck Ebbert &lt;cebbert.lkml@gmail.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/fair: Fix division by zero sysctl_numa_balancing_scan_size</title>
<updated>2014-10-28T09:46:04Z</updated>
<author>
<name>Kirill Tkhai</name>
<email>ktkhai@parallels.com</email>
</author>
<published>2014-10-16T10:39:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6419265899d9bd27e5ff9f8b43db3715407fc2ba'/>
<id>urn:sha1:6419265899d9bd27e5ff9f8b43db3715407fc2ba</id>
<content type='text'>
File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.

This bash command reproduces problem:

$ while :; do echo 0 &gt; /proc/sys/kernel/numa_balancing_scan_size_mb; \
	   echo 256 &gt; /proc/sys/kernel/numa_balancing_scan_size_mb; done

	divide error: 0000 [#1] SMP
	Modules linked in:
	CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
	task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
	RIP: 0010:[&lt;ffffffff81074191&gt;]  [&lt;ffffffff81074191&gt;] task_scan_min+0x21/0x50
	RSP: 0000:ffff880037a6bce0  EFLAGS: 00010246
	RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
	RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
	RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
	R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
	R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
	FS:  00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
	CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
	Stack:
	 ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
	 ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
	 ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
	Call Trace:
	 [&lt;ffffffff810741d1&gt;] task_scan_max+0x11/0x40
	 [&lt;ffffffff81077ef7&gt;] task_numa_fault+0x1f7/0xae0
	 [&lt;ffffffff8115a896&gt;] ? migrate_misplaced_page+0x276/0x300
	 [&lt;ffffffff81134a4d&gt;] handle_mm_fault+0x62d/0xba0
	 [&lt;ffffffff8103e2f1&gt;] __do_page_fault+0x191/0x510
	 [&lt;ffffffff81030122&gt;] ? native_smp_send_reschedule+0x42/0x60
	 [&lt;ffffffff8106dc00&gt;] ? check_preempt_curr+0x80/0xa0
	 [&lt;ffffffff8107092c&gt;] ? wake_up_new_task+0x11c/0x1a0
	 [&lt;ffffffff8104887d&gt;] ? do_fork+0x14d/0x340
	 [&lt;ffffffff811799bb&gt;] ? get_unused_fd_flags+0x2b/0x30
	 [&lt;ffffffff811799df&gt;] ? __fd_install+0x1f/0x60
	 [&lt;ffffffff8103e67c&gt;] do_page_fault+0xc/0x10
	 [&lt;ffffffff8150d322&gt;] page_fault+0x22/0x30
	RIP  [&lt;ffffffff81074191&gt;] task_scan_min+0x21/0x50
	RSP &lt;ffff880037a6bce0&gt;
	---[ end trace 9a826d16936c04de ]---

Also fix race in task_scan_min (it depends on compiler behaviour).

Signed-off-by: Kirill Tkhai &lt;ktkhai@parallels.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Dario Faggioli &lt;raistlin@linux.it&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Jens Axboe &lt;axboe@fb.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
</feed>
