<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/cpuset.c, branch v2.6.35</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v2.6.35</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v2.6.35'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2010-05-27T16:12:44Z</updated>
<entry>
<title>cpusets: new round-robin rotor for SLAB allocations</title>
<updated>2010-05-27T16:12:44Z</updated>
<author>
<name>Jack Steiner</name>
<email>steiner@sgi.com</email>
</author>
<published>2010-05-26T21:42:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6adef3ebe570bcde67fd6c16101451ddde5712b5'/>
<id>urn:sha1:6adef3ebe570bcde67fd6c16101451ddde5712b5</id>
<content type='text'>
We have observed several workloads running on multi-node systems where
memory is assigned unevenly across the nodes in the system.  There are
numerous reasons for this but one is the round-robin rotor in
cpuset_mem_spread_node().

For example, a simple test that writes a multi-page file will allocate
pages on nodes 0 2 4 6 ...  Odd nodes are skipped.  (Sometimes it
allocates on odd nodes &amp; skips even nodes).

An example is shown below.  The program "lfile" writes a file consisting
of 10 pages.  The program then mmaps the file &amp; uses get_mempolicy(...,
MPOL_F_NODE) to determine the nodes where the file pages were allocated.
The output is shown below:

	# ./lfile
	 allocated on nodes: 2 4 6 0 1 2 6 0 2

There is a single rotor that is used for allocating both file pages &amp; slab
pages.  Writing the file allocates both a data page &amp; a slab page
(buffer_head).  This advances the RR rotor 2 nodes for each page
allocated.

A quick confirmation seems to confirm this is the cause of the uneven
allocation:

	# echo 0 &gt;/dev/cpuset/memory_spread_slab
	# ./lfile
	 allocated on nodes: 6 7 8 9 0 1 2 3 4 5

This patch introduces a second rotor that is used for slab allocations.

Signed-off-by: Jack Steiner &lt;steiner@sgi.com&gt;
Acked-by: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Pekka Enberg &lt;penberg@cs.helsinki.fi&gt;
Cc: Paul Menage &lt;menage@google.com&gt;
Cc: Jack Steiner &lt;steiner@sgi.com&gt;
Cc: Robin Holt &lt;holt@sgi.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>cpuset,mm: fix no node to alloc memory when changing cpuset's mems</title>
<updated>2010-05-25T15:06:57Z</updated>
<author>
<name>Miao Xie</name>
<email>miaox@cn.fujitsu.com</email>
</author>
<published>2010-05-24T21:32:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c0ff7453bb5c7c98e0885fb94279f2571946f280'/>
<id>urn:sha1:c0ff7453bb5c7c98e0885fb94279f2571946f280</id>
<content type='text'>
Before applying this patch, cpuset updates task-&gt;mems_allowed and
mempolicy by setting all new bits in the nodemask first, and clearing all
old unallowed bits later.  But in the way, the allocator may find that
there is no node to alloc memory.

The reason is that cpuset rebinds the task's mempolicy, it cleans the
nodes which the allocater can alloc pages on, for example:

(mpol: mempolicy)
	task1			task1's mpol	task2
	alloc page		1
	  alloc on node0? NO	1
				1		change mems from 1 to 0
				1		rebind task1's mpol
				0-1		  set new bits
				0	  	  clear disallowed bits
	  alloc on node1? NO	0
	  ...
	can't alloc page
	  goto oom

This patch fixes this problem by expanding the nodes range first(set newly
allowed bits) and shrink it lazily(clear newly disallowed bits).  So we
use a variable to tell the write-side task that read-side task is reading
nodemask, and the write-side task clears newly disallowed nodes after
read-side task ends the current memory allocation.

[akpm@linux-foundation.org: fix spello]
Signed-off-by: Miao Xie &lt;miaox@cn.fujitsu.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: Paul Menage &lt;menage@google.com&gt;
Cc: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Cc: Hugh Dickins &lt;hugh.dickins@tiscali.co.uk&gt;
Cc: Ravikiran Thirumalai &lt;kiran@scalex86.org&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mempolicy: restructure rebinding-mempolicy functions</title>
<updated>2010-05-25T15:06:57Z</updated>
<author>
<name>Miao Xie</name>
<email>miaox@cn.fujitsu.com</email>
</author>
<published>2010-05-24T21:32:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=708c1bbc9d0c3e57f40501794d9b0eed29d10fce'/>
<id>urn:sha1:708c1bbc9d0c3e57f40501794d9b0eed29d10fce</id>
<content type='text'>
Nick Piggin reported that the allocator may see an empty nodemask when
changing cpuset's mems[1].  It happens only on the kernel that do not do
atomic nodemask_t stores.  (MAX_NUMNODES &gt; BITS_PER_LONG)

But I found that there is also a problem on the kernel that can do atomic
nodemask_t stores.  The problem is that the allocator can't find a node to
alloc page when changing cpuset's mems though there is a lot of free
memory.  The reason is like this:

(mpol: mempolicy)
	task1			task1's mpol	task2
	alloc page		1
	  alloc on node0? NO	1
				1		change mems from 1 to 0
				1		rebind task1's mpol
				0-1		  set new bits
				0	  	  clear disallowed bits
	  alloc on node1? NO	0
	  ...
	can't alloc page
	  goto oom

I can use the attached program reproduce it by the following step:

# mkdir /dev/cpuset
# mount -t cpuset cpuset /dev/cpuset
# mkdir /dev/cpuset/1
# echo `cat /dev/cpuset/cpus` &gt; /dev/cpuset/1/cpus
# echo `cat /dev/cpuset/mems` &gt; /dev/cpuset/1/mems
# echo $$ &gt; /dev/cpuset/1/tasks
# numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog &lt;nr_tasks&gt; &amp;
   &lt;nr_tasks&gt; = max(nr_cpus - 1, 1)
# killall -s SIGUSR1 cpuset_mem_hog
# ./change_mems.sh

several hours later, oom will happen though there is a lot of free memory.

This patchset fixes this problem by expanding the nodes range first(set
newly allowed bits) and shrink it lazily(clear newly disallowed bits).  So
we use a variable to tell the write-side task that read-side task is
reading nodemask, and the write-side task clears newly disallowed nodes
after read-side task ends the current memory allocation.

This patch:

In order to fix no node to alloc memory, when we want to update mempolicy
and mems_allowed, we expand the set of nodes first (set all the newly
nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
mempolicy's rebind functions may breaks the expanding.

So we restructure the mempolicy's rebind functions and split the rebind
work to two steps, just like the update of cpuset's mems: The 1st step:
expand the set of the mempolicy's nodes.  The 2nd step: shrink the set of
the mempolicy's nodes.  It is used when there is no real lock to protect
the mempolicy in the read-side.  Otherwise we can do rebind work at once.

In order to implement it, we define

	enum mpol_rebind_step {
		MPOL_REBIND_ONCE,
		MPOL_REBIND_STEP1,
		MPOL_REBIND_STEP2,
		MPOL_REBIND_NSTEP,
	};

If the mempolicy needn't be updated by two steps, we can pass
MPOL_REBIND_ONCE to the rebind functions.  Or we can pass
MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
MPOL_REBIND_STEP2 to do the second step work.

Besides that, it maybe long time between these two step and we have to
release the lock that protects mempolicy and mems_allowed.  If we hold the
lock once again, we must check whether the current mempolicy is under the
rebinding (the first step has been done) or not, because the task may
alloc a new mempolicy when we don't hold the lock.  So we defined the
following flag to identify it:

#define MPOL_F_REBINDING (1 &lt;&lt; 2)

The new functions will be used in the next patch.

Signed-off-by: Miao Xie &lt;miaox@cn.fujitsu.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: Paul Menage &lt;menage@google.com&gt;
Cc: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Cc: Hugh Dickins &lt;hugh.dickins@tiscali.co.uk&gt;
Cc: Ravikiran Thirumalai &lt;kiran@scalex86.org&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sched: Make select_fallback_rq() cpuset friendly</title>
<updated>2010-04-02T18:12:03Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2010-03-15T09:10:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9084bb8246ea935b98320554229e2f371f7f52fa'/>
<id>urn:sha1:9084bb8246ea935b98320554229e2f371f7f52fa</id>
<content type='text'>
Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
with select_fallback_rq(). It can be called from any context and can't use
any cpuset locks including task_lock(). It is called when the task doesn't
have online cpus in -&gt;cpus_allowed but ttwu/etc must be able to find a
suitable cpu.

I am not proud of this patch. Everything which needs such a fat comment
can't be good even if correct. But I'd prefer to not change the locking
rules in the code I hardly understand, and in any case I believe this
simple change make the code much more correct compared to deadlocks we
currently have.

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
LKML-Reference: &lt;20100315091027.GA9155@redhat.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
</entry>
<entry>
<title>sched: Kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code</title>
<updated>2010-04-02T18:12:01Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2010-03-15T09:10:03Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=897f0b3c3ff40b443c84e271bef19bd6ae885195'/>
<id>urn:sha1:897f0b3c3ff40b443c84e271bef19bd6ae885195</id>
<content type='text'>
This patch just states the fact the cpusets/cpuhotplug interaction is
broken and removes the deadlockable code which only pretends to work.

- cpuset_lock() doesn't really work. It is needed for
  cpuset_cpus_allowed_locked() but we can't take this lock in
  try_to_wake_up()-&gt;select_fallback_rq() path.

- cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
  callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
  stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
  cpuset_lock() and hangs forever because CPU is already dead and thus
  T can't be scheduled.

- cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
  which is not irq-safe, but try_to_wake_up() can be called from irq.

Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
we currently do without CONFIG_CPUSETS.

Also, with or without this patch, with or without CONFIG_CPUSETS, the
callers of select_fallback_rq() can race with each other or with
set_cpus_allowed() pathes.

The subsequent patches try to to fix these problems.

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
LKML-Reference: &lt;20100315091003.GA9123@redhat.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
</entry>
<entry>
<title>cpuset: alloc nodemask_t on the heap rather than the stack</title>
<updated>2010-03-24T23:31:21Z</updated>
<author>
<name>Miao Xie</name>
<email>miaox@cn.fujitsu.com</email>
</author>
<published>2010-03-23T20:35:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=53feb29767c29c877f9d47dcfe14211b5b0f7ebd'/>
<id>urn:sha1:53feb29767c29c877f9d47dcfe14211b5b0f7ebd</id>
<content type='text'>
Signed-off-by: Miao Xie &lt;miaox@cn.fujitsu.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: Paul Menage &lt;menage@google.com&gt;
Cc: Li Zefan &lt;lizf@cn.fujitsu.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node</title>
<updated>2010-03-24T23:31:21Z</updated>
<author>
<name>Miao Xie</name>
<email>miaox@cn.fujitsu.com</email>
</author>
<published>2010-03-23T20:35:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5ab116c9349ef52d6fbd2e2917a53f13194b048e'/>
<id>urn:sha1:5ab116c9349ef52d6fbd2e2917a53f13194b048e</id>
<content type='text'>
cpuset_mem_spread_node() returns an offline node, and causes an oops.

This patch fixes it by initializing task-&gt;mems_allowed to
node_states[N_HIGH_MEMORY], and updating task-&gt;mems_allowed when doing
memory hotplug.

Signed-off-by: Miao Xie &lt;miaox@cn.fujitsu.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Reported-by: Nick Piggin &lt;npiggin@suse.de&gt;
Tested-by: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: Paul Menage &lt;menage@google.com&gt;
Cc: Li Zefan &lt;lizf@cn.fujitsu.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sched: Fix balance vs hotplug race</title>
<updated>2009-12-06T20:10:56Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>a.p.zijlstra@chello.nl</email>
</author>
<published>2009-11-25T12:31:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6ad4c18884e864cf4c77f9074d3d1816063f99cd'/>
<id>urn:sha1:6ad4c18884e864cf4c77f9074d3d1816063f99cd</id>
<content type='text'>
Since (e761b77: cpu hotplug, sched: Introduce cpu_active_map and redo
sched domain managment) we have cpu_active_mask which is suppose to rule
scheduler migration and load-balancing, except it never (fully) did.

The particular problem being solved here is a crash in try_to_wake_up()
where select_task_rq() ends up selecting an offline cpu because
select_task_rq_fair() trusts the sched_domain tree to reflect the
current state of affairs, similarly select_task_rq_rt() trusts the
root_domain.

However, the sched_domains are updated from CPU_DEAD, which is after the
cpu is taken offline and after stop_machine is done. Therefore it can
race perfectly well with code assuming the domains are right.

Cure this by building the domains from cpu_active_mask on
CPU_DOWN_PREPARE.

Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
LKML-Reference: &lt;new-submission&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
</entry>
<entry>
<title>cpumask: Fix generate_sched_domains() for UP</title>
<updated>2009-12-06T20:08:41Z</updated>
<author>
<name>Geert Uytterhoeven</name>
<email>geert@linux-m68k.org</email>
</author>
<published>2009-12-06T19:41:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e1b8090bdf125f8b2e192149547fead7f302a89c'/>
<id>urn:sha1:e1b8090bdf125f8b2e192149547fead7f302a89c</id>
<content type='text'>
Commit acc3f5d7cabbfd6cec71f0c1f9900621fa2d6ae7 ("cpumask:
Partition_sched_domains takes array of cpumask_var_t") changed
the function signature of generate_sched_domains() for the
CONFIG_SMP=y case, but forgot to update the corresponding
function for the CONFIG_SMP=n case, causing:

  kernel/cpuset.c:2073: warning: passing argument 1 of 'generate_sched_domains' from incompatible pointer type

Signed-off-by: Geert Uytterhoeven &lt;geert@linux-m68k.org&gt;
Cc: Rusty Russell &lt;rusty@rustcorp.com.au&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
LKML-Reference: &lt;alpine.DEB.2.00.0912062038070.5693@ayla.of.borg&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
</entry>
<entry>
<title>cpumask: Partition_sched_domains takes array of cpumask_var_t</title>
<updated>2009-11-04T12:16:40Z</updated>
<author>
<name>Rusty Russell</name>
<email>rusty@rustcorp.com.au</email>
</author>
<published>2009-11-03T04:23:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=acc3f5d7cabbfd6cec71f0c1f9900621fa2d6ae7'/>
<id>urn:sha1:acc3f5d7cabbfd6cec71f0c1f9900621fa2d6ae7</id>
<content type='text'>
Currently partition_sched_domains() takes a 'struct cpumask
*doms_new' which is a kmalloc'ed array of cpumask_t.  You can't
have such an array if 'struct cpumask' is undefined, as we plan
for CONFIG_CPUMASK_OFFSTACK=y.

So, we make this an array of cpumask_var_t instead: this is the
same for the CONFIG_CPUMASK_OFFSTACK=n case, but requires
multiple allocations for the CONFIG_CPUMASK_OFFSTACK=y case.
Hence we add alloc_sched_domains() and free_sched_domains()
functions.

Signed-off-by: Rusty Russell &lt;rusty@rustcorp.com.au&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
LKML-Reference: &lt;200911031453.40668.rusty@rustcorp.com.au&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
</entry>
</feed>
