<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/sched.c, branch v3.1</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v3.1</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v3.1'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2011-09-30T12:07:06Z</updated>
<entry>
<title>posix-cpu-timers: Cure SMP wobbles</title>
<updated>2011-09-30T12:07:06Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>a.p.zijlstra@chello.nl</email>
</author>
<published>2011-09-01T10:42:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d670ec13178d0fd8680e6742a2bc6e04f28f87d8'/>
<id>urn:sha1:d670ec13178d0fd8680e6742a2bc6e04f28f87d8</id>
<content type='text'>
David reported:

  Attached below is a watered-down version of rt/tst-cpuclock2.c from
  GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
  similar.

  Run it several times, and you will see cases where the main thread
  will measure a process clock difference before and after the nanosleep
  which is smaller than the cpu-burner thread's individual thread clock
  difference.  This doesn't make any sense since the cpu-burner thread
  is part of the top-level process's thread group.

  I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
  64-bit binaries).

  For example:

  [davem@boricha build-x86_64-linux]$ ./test
  process: before(0.001221967) after(0.498624371) diff(497402404)
  thread:  before(0.000081692) after(0.498316431) diff(498234739)
  self:    before(0.001223521) after(0.001240219) diff(16698)
  [davem@boricha build-x86_64-linux]$ 

  The diff of 'process' should always be &gt;= the diff of 'thread'.

  I make sure to wrap the 'thread' clock measurements the most tightly
  around the nanosleep() call, and that the 'process' clock measurements
  are the outer-most ones.

  ---
  #include &lt;unistd.h&gt;
  #include &lt;stdio.h&gt;
  #include &lt;stdlib.h&gt;
  #include &lt;time.h&gt;
  #include &lt;fcntl.h&gt;
  #include &lt;string.h&gt;
  #include &lt;errno.h&gt;
  #include &lt;pthread.h&gt;

  static pthread_barrier_t barrier;

  static void *chew_cpu(void *arg)
  {
	  pthread_barrier_wait(&amp;barrier);
	  while (1)
		  __asm__ __volatile__("" : : : "memory");
	  return NULL;
  }

  int main(void)
  {
	  clockid_t process_clock, my_thread_clock, th_clock;
	  struct timespec process_before, process_after;
	  struct timespec me_before, me_after;
	  struct timespec th_before, th_after;
	  struct timespec sleeptime;
	  unsigned long diff;
	  pthread_t th;
	  int err;

	  err = clock_getcpuclockid(0, &amp;process_clock);
	  if (err)
		  return 1;

	  err = pthread_getcpuclockid(pthread_self(), &amp;my_thread_clock);
	  if (err)
		  return 1;

	  pthread_barrier_init(&amp;barrier, NULL, 2);
	  err = pthread_create(&amp;th, NULL, chew_cpu, NULL);
	  if (err)
		  return 1;

	  err = pthread_getcpuclockid(th, &amp;th_clock);
	  if (err)
		  return 1;

	  pthread_barrier_wait(&amp;barrier);

	  err = clock_gettime(process_clock, &amp;process_before);
	  if (err)
		  return 1;

	  err = clock_gettime(my_thread_clock, &amp;me_before);
	  if (err)
		  return 1;

	  err = clock_gettime(th_clock, &amp;th_before);
	  if (err)
		  return 1;

	  sleeptime.tv_sec = 0;
	  sleeptime.tv_nsec = 500000000;
	  nanosleep(&amp;sleeptime, NULL);

	  err = clock_gettime(th_clock, &amp;th_after);
	  if (err)
		  return 1;

	  err = clock_gettime(my_thread_clock, &amp;me_after);
	  if (err)
		  return 1;

	  err = clock_gettime(process_clock, &amp;process_after);
	  if (err)
		  return 1;

	  diff = process_after.tv_nsec - process_before.tv_nsec;
	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
		 process_before.tv_sec, process_before.tv_nsec,
		 process_after.tv_sec, process_after.tv_nsec, diff);
	  diff = th_after.tv_nsec - th_before.tv_nsec;
	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
		 th_before.tv_sec, th_before.tv_nsec,
		 th_after.tv_sec, th_after.tv_nsec, diff);
	  diff = me_after.tv_nsec - me_before.tv_nsec;
	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
		 me_before.tv_sec, me_before.tv_nsec,
		 me_after.tv_sec, me_after.tv_nsec, diff);

	  return 0;
  }

This is due to us using p-&gt;se.sum_exec_runtime in
thread_group_cputime() where we iterate the thread group and sum all
data. This does not take time since the last schedule operation (tick
or otherwise) into account. We can cure this by using
task_sched_runtime() at the cost of having to take locks.

This also means we can (and must) do away with
thread_group_sched_runtime() since the modified thread_group_cputime()
is now more accurate and would deadlock when called from
thread_group_sched_runtime().

Aside of that it makes the function safe on 32 bit systems. The old
code added t-&gt;se.sum_exec_runtime unprotected. sum_exec_runtime is a
64bit value and could be changed on another cpu at the same time.

Reported-by: David Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
Tested-by: David Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</content>
</entry>
<entry>
<title>sched: Fix up wchan borkage</title>
<updated>2011-09-26T10:51:08Z</updated>
<author>
<name>Simon Kirby</name>
<email>sim@hostway.ca</email>
</author>
<published>2011-09-23T00:03:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6ebbe7a07b3bc40b168d2afc569a6543c020d2e3'/>
<id>urn:sha1:6ebbe7a07b3bc40b168d2afc569a6543c020d2e3</id>
<content type='text'>
Commit c259e01a1ec ("sched: Separate the scheduler entry for
preemption") contained a boo-boo wrecking wchan output. It forgot to
put the new schedule() function in the __sched section and thereby
doesn't get properly ignored for things like wchan.

Tested-by: Simon Kirby &lt;sim@hostway.ca&gt;
Cc: stable@kernel.org # 2.6.39+
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Link: http://lkml.kernel.org/r/20110923000346.GA25425@hostway.ca
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
</entry>
<entry>
<title>Merge branch 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip</title>
<updated>2011-09-07T20:01:34Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2011-09-07T20:01:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e81b693c0104d6a767f998ee5a2e00b5acbbcd18'/>
<id>urn:sha1:e81b693c0104d6a767f998ee5a2e00b5acbbcd18</id>
<content type='text'>
* 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
  sched: Fix a memory leak in __sdt_free()
  sched: Move blk_schedule_flush_plug() out of __schedule()
  sched: Separate the scheduler entry for preemption
</content>
</entry>
<entry>
<title>perf events: Fix slow and broken cgroup context switch code</title>
<updated>2011-08-29T10:28:33Z</updated>
<author>
<name>Stephane Eranian</name>
<email>eranian@google.com</email>
</author>
<published>2011-08-25T13:58:03Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a8d757ef076f0f95f13a918808824058de25b3eb'/>
<id>urn:sha1:a8d757ef076f0f95f13a918808824058de25b3eb</id>
<content type='text'>
The current cgroup context switch code was incorrect leading
to bogus counts. Furthermore, as soon as there was an active
cgroup event on a CPU, the context switch cost on that CPU
would increase by a significant amount as demonstrated by a
simple ping/pong example:

 $ ./pong
 Both processes pinned to CPU1, running for 10s
 10684.51 ctxsw/s

Now start a cgroup perf stat:
 $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100

$ ./pong
 Both processes pinned to CPU1, running for 10s
 6674.61 ctxsw/s

That's a 37% penalty.

Note that pong is not even in the monitored cgroup.

The results shown by perf stat are bogus:
 $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100

 Performance counter stats for 'sleep 100':

 CPU1 &lt;not counted&gt; cycles   test
 CPU1 16,984,189,138 cycles  #    0.000 GHz

The second 'cycles' event should report a count @ CPU clock
(here 2.4GHz) as it is counting across all cgroups.

The patch below fixes the bogus accounting and bypasses any
cgroup switches in case the outgoing and incoming tasks are
in the same cgroup.

With this patch the same test now yields:
 $ ./pong
 Both processes pinned to CPU1, running for 10s
 10775.30 ctxsw/s

Start perf stat with cgroup:

 $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10

Run pong outside the cgroup:
 $ /pong
 Both processes pinned to CPU1, running for 10s
 10687.80 ctxsw/s

The penalty is now less than 2%.

And the results for perf stat are correct:

$ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10

 Performance counter stats for 'sleep 10':

 CPU1 &lt;not counted&gt; cycles test #    0.000 GHz
 CPU1 23,933,981,448 cycles      #    0.000 GHz

Now perf stat reports the correct counts for
for the non cgroup event.

If we run pong inside the cgroup, then we also get the
correct counts:

$ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10

 Performance counter stats for 'sleep 10':

 CPU1 22,297,726,205 cycles test #    0.000 GHz
 CPU1 23,933,981,448 cycles      #    0.000 GHz

      10.001457237 seconds time elapsed

Signed-off-by: Stephane Eranian &lt;eranian@google.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
</entry>
<entry>
<title>sched: Fix a memory leak in __sdt_free()</title>
<updated>2011-08-29T10:27:01Z</updated>
<author>
<name>WANG Cong</name>
<email>amwang@redhat.com</email>
</author>
<published>2011-08-18T12:36:57Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=feff8fa0075bdfd43c841e9d689ed81adda988d6'/>
<id>urn:sha1:feff8fa0075bdfd43c841e9d689ed81adda988d6</id>
<content type='text'>
This patch fixes the following memory leak:

unreferenced object 0xffff880107266800 (size 512):
  comm "sched-powersave", pid 3718, jiffies 4323097853 (age 27495.450s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [&lt;ffffffff81133940&gt;] create_object+0x187/0x28b
    [&lt;ffffffff814ac103&gt;] kmemleak_alloc+0x73/0x98
    [&lt;ffffffff811232ba&gt;] __kmalloc_node+0x104/0x159
    [&lt;ffffffff81044b98&gt;] kzalloc_node.clone.97+0x15/0x17
    [&lt;ffffffff8104cb90&gt;] build_sched_domains+0xb7/0x7f3
    [&lt;ffffffff8104d4df&gt;] partition_sched_domains+0x1db/0x24a
    [&lt;ffffffff8109ee4a&gt;] do_rebuild_sched_domains+0x3b/0x47
    [&lt;ffffffff810a00c7&gt;] rebuild_sched_domains+0x10/0x12
    [&lt;ffffffff8104d5ba&gt;] sched_power_savings_store+0x6c/0x7b
    [&lt;ffffffff8104d5df&gt;] sched_mc_power_savings_store+0x16/0x18
    [&lt;ffffffff8131322c&gt;] sysdev_class_store+0x20/0x22
    [&lt;ffffffff81193876&gt;] sysfs_write_file+0x108/0x144
    [&lt;ffffffff81135b10&gt;] vfs_write+0xaf/0x102
    [&lt;ffffffff81135d23&gt;] sys_write+0x4d/0x74
    [&lt;ffffffff814c8a42&gt;] system_call_fastpath+0x16/0x1b
    [&lt;ffffffffffffffff&gt;] 0xffffffffffffffff

Signed-off-by: WANG Cong &lt;amwang@redhat.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: stable@kernel.org # 3.0
Link: http://lkml.kernel.org/r/1313671017-4112-1-git-send-email-amwang@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
</entry>
<entry>
<title>sched: Move blk_schedule_flush_plug() out of __schedule()</title>
<updated>2011-08-29T10:26:59Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2011-06-22T17:47:01Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9c40cef2b799f9b5e7fa5de4d2ad3a0168ba118c'/>
<id>urn:sha1:9c40cef2b799f9b5e7fa5de4d2ad3a0168ba118c</id>
<content type='text'>
There is no real reason to run blk_schedule_flush_plug() with
interrupts and preemption disabled.

Move it into schedule() and call it when the task is going voluntarily
to sleep. There might be false positives when the task is woken
between that call and actually scheduling, but that's not really
different from being woken immediately after switching away.

This fixes a deadlock in the scheduler where the
blk_schedule_flush_plug() callchain enables interrupts and thereby
allows a wakeup to happen of the task that's going to sleep.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: stable@kernel.org # 2.6.39+
Link: http://lkml.kernel.org/n/tip-dwfxtra7yg1b5r65m32ywtct@git.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
</entry>
<entry>
<title>sched: Separate the scheduler entry for preemption</title>
<updated>2011-08-29T10:26:57Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2011-06-22T17:47:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c259e01a1ec90063042f758e409cd26b2a0963c8'/>
<id>urn:sha1:c259e01a1ec90063042f758e409cd26b2a0963c8</id>
<content type='text'>
Block-IO and workqueues call into notifier functions from the
scheduler core code with interrupts and preemption disabled. These
calls should be made before entering the scheduler core.

To simplify this, separate the scheduler core code into
__schedule(). __schedule() is directly called from the places which
set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
checks into schedule(), so they are only called when a task voluntary
goes to sleep.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: stable@kernel.org # 2.6.39+
Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.de
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial</title>
<updated>2011-07-25T20:56:39Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2011-07-25T20:56:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d3ec4844d449cf7af9e749f73ba2052fb7b72fc2'/>
<id>urn:sha1:d3ec4844d449cf7af9e749f73ba2052fb7b72fc2</id>
<content type='text'>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
  fs: Merge split strings
  treewide: fix potentially dangerous trailing ';' in #defined values/expressions
  uwb: Fix misspelling of neighbourhood in comment
  net, netfilter: Remove redundant goto in ebt_ulog_packet
  trivial: don't touch files that are removed in the staging tree
  lib/vsprintf: replace link to Draft by final RFC number
  doc: Kconfig: `to be' -&gt; `be'
  doc: Kconfig: Typo: square -&gt; squared
  doc: Konfig: Documentation/power/{pm =&gt; apm-acpi}.txt
  drivers/net: static should be at beginning of declaration
  drivers/media: static should be at beginning of declaration
  drivers/i2c: static should be at beginning of declaration
  XTENSA: static should be at beginning of declaration
  SH: static should be at beginning of declaration
  MIPS: static should be at beginning of declaration
  ARM: static should be at beginning of declaration
  rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
  Update my e-mail address
  PCIe ASPM: forcedly -&gt; forcibly
  gma500: push through device driver tree
  ...

Fix up trivial conflicts:
 - arch/arm/mach-ep93xx/dma-m2p.c (deleted)
 - drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
 - drivers/net/r8169.c (just context changes)
</content>
</entry>
<entry>
<title>Merge branch 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm</title>
<updated>2011-07-24T16:07:03Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2011-07-24T16:07:03Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5fabc487c96819dd12ddb9414835d170fd9cd6d5'/>
<id>urn:sha1:5fabc487c96819dd12ddb9414835d170fd9cd6d5</id>
<content type='text'>
* 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
  KVM: IOMMU: Disable device assignment without interrupt remapping
  KVM: MMU: trace mmio page fault
  KVM: MMU: mmio page fault support
  KVM: MMU: reorganize struct kvm_shadow_walk_iterator
  KVM: MMU: lockless walking shadow page table
  KVM: MMU: do not need atomicly to set/clear spte
  KVM: MMU: introduce the rules to modify shadow page table
  KVM: MMU: abstract some functions to handle fault pfn
  KVM: MMU: filter out the mmio pfn from the fault pfn
  KVM: MMU: remove bypass_guest_pf
  KVM: MMU: split kvm_mmu_free_page
  KVM: MMU: count used shadow pages on prepareing path
  KVM: MMU: rename 'pt_write' to 'emulate'
  KVM: MMU: cleanup for FNAME(fetch)
  KVM: MMU: optimize to handle dirty bit
  KVM: MMU: cache mmio info on page fault path
  KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code
  KVM: MMU: do not update slot bitmap if spte is nonpresent
  KVM: MMU: fix walking shadow page table
  KVM guest: KVM Steal time registration
  ...
</content>
</entry>
<entry>
<title>Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip</title>
<updated>2011-07-22T23:45:02Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2011-07-22T23:45:02Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=bdc7ccfc0631797636837b10df7f87bc1e2e4ae3'/>
<id>urn:sha1:bdc7ccfc0631797636837b10df7f87bc1e2e4ae3</id>
<content type='text'>
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
  sched: Cleanup duplicate local variable in [enqueue|dequeue]_task_fair
  sched: Replace use of entity_key()
  sched: Separate group-scheduling code more clearly
  sched: Reorder root_domain to remove 64 bit alignment padding
  sched: Do not attempt to destroy uninitialized rt_bandwidth
  sched: Remove unused function cpu_cfs_rq()
  sched: Fix (harmless) typo 'CONFG_FAIR_GROUP_SCHED'
  sched, cgroup: Optimize load_balance_fair()
  sched: Don't update shares twice on on_rq parent
  sched: update correct entity's runtime in check_preempt_wakeup()
  xtensa: Use generic config PREEMPT definition
  h8300: Use generic config PREEMPT definition
  m32r: Use generic PREEMPT config
  sched: Skip autogroup when looking for all rt sched groups
  sched: Simplify mutex_spin_on_owner()
  sched: Remove rcu_read_lock() from wake_affine()
  sched: Generalize sleep inside spinlock detection
  sched: Make sleeping inside spinlock detection working in !CONFIG_PREEMPT
  sched: Isolate preempt counting in its own config option
  sched: Remove pointless in_atomic() definition check
  ...
</content>
</entry>
</feed>
