<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/fork.c, branch v3.12</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v3.12</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v3.12'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2013-09-13T17:55:58Z</updated>
<entry>
<title>Merge git://git.kvack.org/~bcrl/aio-next</title>
<updated>2013-09-13T17:55:58Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2013-09-13T17:55:58Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9bf12df31f282e845b3dfaac1e5d5376a041da22'/>
<id>urn:sha1:9bf12df31f282e845b3dfaac1e5d5376a041da22</id>
<content type='text'>
Pull aio changes from Ben LaHaise:
 "First off, sorry for this pull request being late in the merge window.
  Al had raised a couple of concerns about 2 items in the series below.
  I addressed the first issue (the race introduced by Gu's use of
  mm_populate()), but he has not provided any further details on how he
  wants to rework the anon_inode.c changes (which were sent out months
  ago but have yet to be commented on).

  The bulk of the changes have been sitting in the -next tree for a few
  months, with all the issues raised being addressed"

* git://git.kvack.org/~bcrl/aio-next: (22 commits)
  aio: rcu_read_lock protection for new rcu_dereference calls
  aio: fix race in ring buffer page lookup introduced by page migration support
  aio: fix rcu sparse warnings introduced by ioctx table lookup patch
  aio: remove unnecessary debugging from aio_free_ring()
  aio: table lookup: verify ctx pointer
  staging/lustre: kiocb-&gt;ki_left is removed
  aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
  aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
  aio: convert the ioctx list to table lookup v3
  aio: double aio_max_nr in calculations
  aio: Kill ki_dtor
  aio: Kill ki_users
  aio: Kill unneeded kiocb members
  aio: Kill aio_rw_vect_retry()
  aio: Don't use ctx-&gt;tail unnecessarily
  aio: io_cancel() no longer returns the io_event
  aio: percpu ioctx refcount
  aio: percpu reqs_available
  aio: reqs_active -&gt; reqs_available
  aio: fix build when migration is disabled
  ...
</content>
</entry>
<entry>
<title>mm: mempolicy: turn vma_set_policy() into vma_dup_policy()</title>
<updated>2013-09-11T22:57:00Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2013-09-11T21:20:14Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ef0855d334e1e4af7c3e0c42146a8479ea14a5ab'/>
<id>urn:sha1:ef0855d334e1e4af7c3e0c42146a8479ea14a5ab</id>
<content type='text'>
Simple cleanup.  Every user of vma_set_policy() does the same work, this
looks a bit annoying imho.  And the new trivial helper which does
mpol_dup() + vma_set_policy() to simplify the callers.

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks</title>
<updated>2013-09-11T22:56:20Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2013-09-11T21:19:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=40a0d32d1eaffe6aac7324ca92604b6b3977eb0e'/>
<id>urn:sha1:40a0d32d1eaffe6aac7324ca92604b6b3977eb0e</id>
<content type='text'>
do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID.

Then later copy_process() denies CLONE_SIGHAND if the new process will
be in a different pid namespace (task_active_pid_ns() doesn't match
current-&gt;nsproxy-&gt;pid_ns).

This looks confusing and inconsistent.  CLONE_NEWPID is very similar to
the case when -&gt;pid_ns was already unshared, we want the same
restrictions so copy_process() should also nack CLONE_PARENT.

And it would be better to deny CLONE_NEWUSER &amp;&amp; CLONE_SIGHAND as well
just for consistency.

Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change
copy_process() to do the same check along with -&gt;pid_ns check we already
have.

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: Andy Lutomirski &lt;luto@amacapital.net&gt;
Cc: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Colin Walters &lt;walters@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pidns: kill the unnecessary CLONE_NEWPID in copy_process()</title>
<updated>2013-09-11T22:56:19Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2013-09-11T21:19:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5167246a8ad617df55717c2d901da5e2aedffcfa'/>
<id>urn:sha1:5167246a8ad617df55717c2d901da5e2aedffcfa</id>
<content type='text'>
Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process
unshared pid_ns.  This is correct but unnecessary, copy_pid_ns() does
the same check.

Remove the CLONE_NEWPID check to cleanup the code and prepare for the
next change.

Test-case:

	static int child(void *arg)
	{
		return 0;
	}

	static char stack[16 * 1024];

	int main(void)
	{
		pid_t pid;

		assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);

		pid = clone(child, stack + sizeof(stack) / 2,
				CLONE_NEWPID | SIGCHLD, NULL);
		assert(pid &lt; 0 &amp;&amp; errno == EINVAL);

		return 0;
	}

clone(CLONE_NEWPID) correctly fails with or without this change.

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: Andy Lutomirski &lt;luto@amacapital.net&gt;
Cc: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Colin Walters &lt;walters@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pidns: fix vfork() after unshare(CLONE_NEWPID)</title>
<updated>2013-09-11T22:56:19Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2013-09-11T21:19:38Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e79f525e99b04390ca4d2366309545a836c03bf1'/>
<id>urn:sha1:e79f525e99b04390ca4d2366309545a836c03bf1</id>
<content type='text'>
Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
pid_ns, this obviously breaks vfork:

	int main(void)
	{
		assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
		assert(vfork() &gt;= 0);
		_exit(0);
		return 0;
	}

fails without this patch.

Change this check to use CLONE_SIGHAND instead.  This also forbids
CLONE_THREAD automatically, and this is what the comment implies.

We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
would be safer to not do this.  The current check denies CLONE_SIGHAND
implicitely and there is no reason to change this.

Eric said "CLONE_SIGHAND is fine.  CLONE_THREAD would be even better.
Having shared signal handling between two different pid namespaces is
the case that we are fundamentally guarding against."

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reported-by: Colin Walters &lt;walters@redhat.com&gt;
Acked-by: Andy Lutomirski &lt;luto@amacapital.net&gt;
Reviewed-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace</title>
<updated>2013-09-07T21:35:32Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2013-09-07T21:35:32Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c7c4591db64dbd1e504bc4e2806d7ef290a3c81b'/>
<id>urn:sha1:c7c4591db64dbd1e504bc4e2806d7ef290a3c81b</id>
<content type='text'>
Pull namespace changes from Eric Biederman:
 "This is an assorted mishmash of small cleanups, enhancements and bug
  fixes.

  The major theme is user namespace mount restrictions.  nsown_capable
  is killed as it encourages not thinking about details that need to be
  considered.  A very hard to hit pid namespace exiting bug was finally
  tracked and fixed.  A couple of cleanups to the basic namespace
  infrastructure.

  Finally there is an enhancement that makes per user namespace
  capabilities usable as capabilities, and an enhancement that allows
  the per userns root to nice other processes in the user namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  userns:  Kill nsown_capable it makes the wrong thing easy
  capabilities: allow nice if we are privileged
  pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
  userns: Allow PR_CAPBSET_DROP in a user namespace.
  namespaces: Simplify copy_namespaces so it is clear what is going on.
  pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
  sysfs: Restrict mounting sysfs
  userns: Better restrictions on when proc and sysfs can be mounted
  vfs: Don't copy mount bind mounts of /proc/&lt;pid&gt;/ns/mnt between namespaces
  kernel/nsproxy.c: Improving a snippet of code.
  proc: Restrict mounting the proc filesystem
  vfs: Lock in place mounts from more privileged users
</content>
</entry>
<entry>
<title>pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD</title>
<updated>2013-08-31T06:44:00Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2013-03-05T21:59:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6e556ce209b09528dbf1931cbfd5d323e1345926'/>
<id>urn:sha1:6e556ce209b09528dbf1931cbfd5d323e1345926</id>
<content type='text'>
I goofed when I made unshare(CLONE_NEWPID) only work in a
single-threaded process.  There is no need for that requirement and in
fact I analyzied things right for setns.  The hard requirement
is for tasks that share a VM to all be in the pid namespace and
we properly prevent that in do_fork.

Just to be certain I took a look through do_wait and
forget_original_parent and there are no cases that make it any harder
for children to be in the multiple pid namespaces than it is for
children to be in the same pid namespace.  I also performed a check to
see if there were in uses of task-&gt;nsproxy_pid_ns I was not familiar
with, but it is only used when allocating a new pid for a new task,
and in checks to prevent craziness from happening.

Acked-by: Serge Hallyn &lt;serge.hallyn@canonical.com&gt;
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>Rename nsproxy.pid_ns to nsproxy.pid_ns_for_children</title>
<updated>2013-08-27T17:52:52Z</updated>
<author>
<name>Andy Lutomirski</name>
<email>luto@amacapital.net</email>
</author>
<published>2013-08-22T18:39:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c2b1df2eb42978073ec27c99cc199d20ae48b849'/>
<id>urn:sha1:c2b1df2eb42978073ec27c99cc199d20ae48b849</id>
<content type='text'>
nsproxy.pid_ns is *not* the task's pid namespace.  The name should clarify
that.

This makes it more obvious that setns on a pid namespace is weird --
it won't change the pid namespace shown in procfs.

Signed-off-by: Andy Lutomirski &lt;luto@amacapital.net&gt;
Reviewed-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>microblaze: fix clone syscall</title>
<updated>2013-08-14T00:57:48Z</updated>
<author>
<name>Michal Simek</name>
<email>michal.simek@xilinx.com</email>
</author>
<published>2013-08-13T23:00:53Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=dfa9771a7c4784bafd0673bc7abcee3813088b77'/>
<id>urn:sha1:dfa9771a7c4784bafd0673bc7abcee3813088b77</id>
<content type='text'>
Fix inadvertent breakage in the clone syscall ABI for Microblaze that
was introduced in commit f3268edbe6fe ("microblaze: switch to generic
fork/vfork/clone").

The Microblaze syscall ABI for clone takes the parent tid address in the
4th argument; the third argument slot is used for the stack size.  The
incorrectly-used CLONE_BACKWARDS type assigned parent tid to the 3rd
slot.

This commit restores the original ABI so that existing userspace libc
code will work correctly.

All kernel versions from v3.8-rc1 were affected.

Signed-off-by: Michal Simek &lt;michal.simek@xilinx.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>aio: convert the ioctx list to table lookup v3</title>
<updated>2013-07-30T16:56:36Z</updated>
<author>
<name>Benjamin LaHaise</name>
<email>bcrl@kvack.org</email>
</author>
<published>2013-07-30T16:54:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=db446a08c23d5475e6b08c87acca79ebb20f283c'/>
<id>urn:sha1:db446a08c23d5475e6b08c87acca79ebb20f283c</id>
<content type='text'>
On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
&gt; On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
&gt; &gt; When using a large number of threads performing AIO operations the
&gt; &gt; IOCTX list may get a significant number of entries which will cause
&gt; &gt; significant overhead. For example, when running this fio script:
&gt; &gt;
&gt; &gt; rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
&gt; &gt; blocksize=1024; numjobs=512; thread; loops=100
&gt; &gt;
&gt; &gt; on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
&gt; &gt; 30% CPU time spent by lookup_ioctx:
&gt; &gt;
&gt; &gt;  32.51%  [guest.kernel]  [g] lookup_ioctx
&gt; &gt;   9.19%  [guest.kernel]  [g] __lock_acquire.isra.28
&gt; &gt;   4.40%  [guest.kernel]  [g] lock_release
&gt; &gt;   4.19%  [guest.kernel]  [g] sched_clock_local
&gt; &gt;   3.86%  [guest.kernel]  [g] local_clock
&gt; &gt;   3.68%  [guest.kernel]  [g] native_sched_clock
&gt; &gt;   3.08%  [guest.kernel]  [g] sched_clock_cpu
&gt; &gt;   2.64%  [guest.kernel]  [g] lock_release_holdtime.part.11
&gt; &gt;   2.60%  [guest.kernel]  [g] memcpy
&gt; &gt;   2.33%  [guest.kernel]  [g] lock_acquired
&gt; &gt;   2.25%  [guest.kernel]  [g] lock_acquire
&gt; &gt;   1.84%  [guest.kernel]  [g] do_io_submit
&gt; &gt;
&gt; &gt; This patchs converts the ioctx list to a radix tree. For a performance
&gt; &gt; comparison the above FIO script was run on a 2 sockets 8 core
&gt; &gt; machine. This are the results (average and %rsd of 10 runs) for the
&gt; &gt; original list based implementation and for the radix tree based
&gt; &gt; implementation:
&gt; &gt;
&gt; &gt; cores         1         2         4         8         16        32
&gt; &gt; list       109376 ms  69119 ms  35682 ms  22671 ms  19724 ms  16408 ms
&gt; &gt; %rsd         0.69%      1.15%     1.17%     1.21%     1.71%     1.43%
&gt; &gt; radix       73651 ms  41748 ms  23028 ms  16766 ms  15232 ms   13787 ms
&gt; &gt; %rsd         1.19%      0.98%     0.69%     1.13%    0.72%      0.75%
&gt; &gt; % of radix
&gt; &gt; relative    66.12%     65.59%    66.63%    72.31%   77.26%     83.66%
&gt; &gt; to list
&gt; &gt;
&gt; &gt; To consider the impact of the patch on the typical case of having
&gt; &gt; only one ctx per process the following FIO script was run:
&gt; &gt;
&gt; &gt; rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
&gt; &gt; blocksize=1024; numjobs=1; thread; loops=100
&gt; &gt;
&gt; &gt; on the same system and the results are the following:
&gt; &gt;
&gt; &gt; list        58892 ms
&gt; &gt; %rsd         0.91%
&gt; &gt; radix       59404 ms
&gt; &gt; %rsd         0.81%
&gt; &gt; % of radix
&gt; &gt; relative    100.87%
&gt; &gt; to list
&gt;
&gt; So, I was just doing some benchmarking/profiling to get ready to send
&gt; out the aio patches I've got for 3.11 - and it looks like your patch is
&gt; causing a ~1.5% throughput regression in my testing :/
... &lt;snip&gt;

I've got an alternate approach for fixing this wart in lookup_ioctx()...
Instead of using an rbtree, just use the reserved id in the ring buffer
header to index an array pointing the ioctx.  It's not finished yet, and
it needs to be tidied up, but is most of the way there.

		-ben
--
"Thought is the essence of where you are now."
--
kmo&gt; And, a rework of Ben's code, but this was entirely his idea
kmo&gt;		-Kent

bcrl&gt; And fix the code to use the right mm_struct in kill_ioctx(), actually
free memory.

Signed-off-by: Benjamin LaHaise &lt;bcrl@kvack.org&gt;
</content>
</entry>
</feed>
