<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/fs/exec.c, branch v2.6.34</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v2.6.34</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v2.6.34'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2010-05-12T00:33:41Z</updated>
<entry>
<title>revert "procfs: provide stack information for threads" and its fixup commits</title>
<updated>2010-05-12T00:33:41Z</updated>
<author>
<name>Robin Holt</name>
<email>holt@sgi.com</email>
</author>
<published>2010-05-11T21:06:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=34441427aab4bdb3069a4ffcda69a99357abcb2e'/>
<id>urn:sha1:34441427aab4bdb3069a4ffcda69a99357abcb2e</id>
<content type='text'>
Originally, commit d899bf7b ("procfs: provide stack information for
threads") attempted to introduce a new feature for showing where the
threadstack was located and how many pages are being utilized by the
stack.

Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
applied to fix the NO_MMU case.

Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
64-bit") was applied to fix a bug in ia32 executables being loaded.

Commit 9ebd4eba7 ("procfs: fix /proc/&lt;pid&gt;/stat stack pointer for kernel
threads") was applied to fix a bug which had kernel threads printing a
userland stack address.

Commit 1306d603f ('proc: partially revert "procfs: provide stack
information for threads"') was then applied to revert the stack pages
being used to solve a significant performance regression.

This patch nearly undoes the effect of all these patches.

The reason for reverting these is it provides an unusable value in
field 28.  For x86_64, a fork will result in the task-&gt;stack_start
value being updated to the current user top of stack and not the stack
start address.  This unpredictability of the stack_start value makes
it worthless.  That includes the intended use of showing how much stack
space a thread has.

Other architectures will get different values.  As an example, ia64
gets 0.  The do_fork() and copy_process() functions appear to treat the
stack_start and stack_size parameters as architecture specific.

I only partially reverted c44972f1 ("procfs: disable per-task stack usage
on NOMMU") .  If I had completely reverted it, I would have had to change
mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
configured.  Since I could not test the builds without significant effort,
I decided to not change mm/Makefile.

I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
information for threads on 64-bit") .  I left the KSTK_ESP() change in
place as that seemed worthwhile.

Signed-off-by: Robin Holt &lt;holt@sgi.com&gt;
Cc: Stefani Seibold &lt;stefani@seibold.net&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Michal Simek &lt;monstr@monstr.eu&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>coredump: suppress uid comparison test if core output files are pipes</title>
<updated>2010-03-06T19:26:46Z</updated>
<author>
<name>Neil Horman</name>
<email>nhorman@tuxdriver.com</email>
</author>
<published>2010-03-05T21:44:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=76595f79d76fbe6267a51b3a866a028d150f06d4'/>
<id>urn:sha1:76595f79d76fbe6267a51b3a866a028d150f06d4</id>
<content type='text'>
Modify uid check in do_coredump so as to not apply it in the case of
pipes.

This just got noticed in testing.  The end of do_coredump validates the
uid of the inode for the created file against the uid of the crashing
process to ensure that no one can pre-create a core file with different
ownership and grab the information contained in the core when they
shouldn' tbe able to.  This causes failures when using pipes for a core
dumps if the crashing process is not root, which is the uid of the pipe
when it is created.

The fix is simple.  Since the check for matching uid's isn't relevant for
pipes (a process can't create a pipe that the uermodehelper code will open
anyway), we can just just skip it in the event ispipe is non-zero

Reverts a pipe-affecting change which was accidentally made in

: commit c46f739dd39db3b07ab5deb4e3ec81e1c04a91af
: Author:     Ingo Molnar &lt;mingo@elte.hu&gt;
: AuthorDate: Wed Nov 28 13:59:18 2007 +0100
: Commit:     Linus Torvalds &lt;torvalds@woody.linux-foundation.org&gt;
: CommitDate: Wed Nov 28 10:58:01 2007 -0800
:
:     vfs: coredumping fix

Signed-off-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Alan Cox &lt;alan@lxorguk.ukuu.org.uk&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>coredump: set -&gt;group_exit_code for other CLONE_VM tasks too</title>
<updated>2010-03-06T19:26:46Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2010-03-05T21:44:14Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5c99cbf49a6e1a1efd25b11f4604c65c455e1612'/>
<id>urn:sha1:5c99cbf49a6e1a1efd25b11f4604c65c455e1612</id>
<content type='text'>
User visible change.

do_coredump() kills all threads which share the same -&gt;mm but only the
coredumping process gets the proper exit_code.  Other tasks which share
the same -&gt;mm die "silently" and return status == 0 to parent.

This is historical behaviour, not actually a bug.  But I think Frank
Heckenbach rightly dislikes the current behaviour.  Simple test-case:

	#include &lt;stdio.h&gt;
	#include &lt;unistd.h&gt;
	#include &lt;signal.h&gt;
	#include &lt;sys/wait.h&gt;

	int main(void)
	{
		int stat;

		if (!fork()) {
			if (!vfork())
				kill(getpid(), SIGQUIT);
		}

		wait(&amp;stat);
		printf("stat=%x\n", stat);
		return 0;
	}

Before this patch it prints "stat=0" despite the fact the child was killed
by SIGQUIT.  After this patch the output is "stat=3" which obviously makes
more sense.

Even with this patch, only the task which originates the coredumping gets
"|= 0x80" if the core was actually dumped, but at least the coredumping
signal is visible to do_wait/etc.

Reported-by: Frank Heckenbach &lt;f.heckenbach@fh-soft.de&gt;
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: WANG Cong &lt;xiyou.wangcong@gmail.com&gt;
Cc: Roland McGrath &lt;roland@redhat.com&gt;
Cc: Neil Horman &lt;nhorman@tuxdriver.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>coredump: pass mm-&gt;flags as a coredump parameter for consistency</title>
<updated>2010-03-06T19:26:46Z</updated>
<author>
<name>Masami Hiramatsu</name>
<email>mhiramat@redhat.com</email>
</author>
<published>2010-03-05T21:44:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=30736a4d43f4af7f1a7836d6a266be17082195c4'/>
<id>urn:sha1:30736a4d43f4af7f1a7836d6a266be17082195c4</id>
<content type='text'>
Pass mm-&gt;flags as a coredump parameter for consistency.

 ---
1787         if (mm-&gt;core_state || !get_dumpable(mm)) {  &lt;- (1)
1788                 up_write(&amp;mm-&gt;mmap_sem);
1789                 put_cred(cred);
1790                 goto fail;
1791         }
1792
[...]
1798         if (get_dumpable(mm) == 2) {    /* Setuid core dump mode */ &lt;-(2)
1799                 flag = O_EXCL;          /* Stop rewrite attacks */
1800                 cred-&gt;fsuid = 0;        /* Dump root private */
1801         }
 ---

Since dumpable bits are not protected by lock, there is a chance to change
these bits between (1) and (2).

To solve this issue, this patch copies mm-&gt;flags to
coredump_params.mm_flags at the beginning of do_coredump() and uses it
instead of get_dumpable() while dumping core.

This copy is also passed to binfmt-&gt;core_dump, since elf*_core_dump() uses
dump_filter bits in mm-&gt;flags.

[akpm@linux-foundation.org: fix merge]
Signed-off-by: Masami Hiramatsu &lt;mhiramat@redhat.com&gt;
Acked-by: Roland McGrath &lt;roland@redhat.com&gt;
Cc: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Reviewed-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>exec: create initial stack independent of PAGE_SIZE</title>
<updated>2010-03-06T19:26:33Z</updated>
<author>
<name>Michael Neuling</name>
<email>mikey@neuling.org</email>
</author>
<published>2010-03-05T21:42:57Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5ef097dd7ba4eab8b4f0026d85fcef9fe23b821f'/>
<id>urn:sha1:5ef097dd7ba4eab8b4f0026d85fcef9fe23b821f</id>
<content type='text'>
Currently we create the initial stack based on the PAGE_SIZE.  This is
unnecessary.

This creates this initial stack independent of the PAGE_SIZE.

It also bumps up the number of 4k pages allocated from 20 to 32, to
align with 64K page systems.

Signed-off-by: Michael Neuling &lt;mikey@neuling.org&gt;
Cc: Helge Deller &lt;deller@gmx.de&gt;
Reviewed-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Americo Wang &lt;xiyou.wangcong@gmail.com&gt;
Cc: Anton Blanchard &lt;anton@samba.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>fs: use rlimit helpers</title>
<updated>2010-03-06T19:26:29Z</updated>
<author>
<name>Jiri Slaby</name>
<email>jslaby@suse.cz</email>
</author>
<published>2010-03-05T21:42:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d554ed895dc8f293cc712c71f14b101ace82579a'/>
<id>urn:sha1:d554ed895dc8f293cc712c71f14b101ace82579a</id>
<content type='text'>
Make sure compiler won't do weird things with limits.  E.g.  fetching them
twice may return 2 different values after writable limits are implemented.

I.e.  either use rlimit helpers added in commit 3e10e716abf3 ("resource:
add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby &lt;jslaby@suse.cz&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: change anon_vma linking to fix multi-process server scalability issue</title>
<updated>2010-03-06T19:26:26Z</updated>
<author>
<name>Rik van Riel</name>
<email>riel@redhat.com</email>
</author>
<published>2010-03-05T21:42:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5beb49305251e5669852ed541e8e2f2f7696c53e'/>
<id>urn:sha1:5beb49305251e5669852ed541e8e2f2f7696c53e</id>
<content type='text'>
The old anon_vma code can lead to scalability issues with heavily forking
workloads.  Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes.  However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock.  This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands.  Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA.  At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated.  The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
 This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations.  This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures.  This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock.  To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
&gt;99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time.  The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Larry Woodman &lt;lwoodman@redhat.com&gt;
Cc: Lee Schermerhorn &lt;Lee.Schermerhorn@hp.com&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Hugh Dickins &lt;hugh.dickins@tiscali.co.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: avoid false sharing of mm_counter</title>
<updated>2010-03-06T19:26:24Z</updated>
<author>
<name>KAMEZAWA Hiroyuki</name>
<email>kamezawa.hiroyu@jp.fujitsu.com</email>
</author>
<published>2010-03-05T21:41:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=34e55232e59f7b19050267a05ff1226e5cd122a5'/>
<id>urn:sha1:34e55232e59f7b19050267a05ff1226e5cd122a5</id>
<content type='text'>
Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.

This patch adds per-thread cache for mm_counter.  RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.

Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.

 rough estimation with small benchmark on parallel thread (2threads) shows
 [before]
     4.5 cache-miss/faults
 [after]
     4.0 cache-miss/faults
 Anyway, the most contended object is mmap_sem if the number of threads grows.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>fs/exec.c: fix initial stack reservation</title>
<updated>2010-02-23T03:50:34Z</updated>
<author>
<name>Michael Neuling</name>
<email>mikey@neuling.org</email>
</author>
<published>2010-02-22T20:44:24Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a17e18790a8c47113a73139d54a375dc9ccd8f08'/>
<id>urn:sha1:a17e18790a8c47113a73139d54a375dc9ccd8f08</id>
<content type='text'>
803bf5ec259941936262d10ecc84511b76a20921 ("fs/exec.c: restrict initial
stack space expansion to rlimit") attempts to limit the initial stack to
20*PAGE_SIZE.  Unfortunately, in attempting ensure the stack is not
reduced in size, we ended up not changing the stack at all.

This size reduction check is not necessary as the expand_stack call does
this already.

This caused a regression in UML resulting in most guest processes being
killed.

Signed-off-by: Michael Neuling &lt;mikey@neuling.org&gt;
Reviewed-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Acked-by: WANG Cong &lt;xiyou.wangcong@gmail.com&gt;
Cc: Anton Blanchard &lt;anton@samba.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: James Morris &lt;jmorris@namei.org&gt;
Cc: Serge Hallyn &lt;serue@us.ibm.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Jouni Malinen &lt;j@w1.fi&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>fs/exec.c: restrict initial stack space expansion to rlimit</title>
<updated>2010-02-11T21:59:43Z</updated>
<author>
<name>Michael Neuling</name>
<email>mikey@neuling.org</email>
</author>
<published>2010-02-10T21:56:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=803bf5ec259941936262d10ecc84511b76a20921'/>
<id>urn:sha1:803bf5ec259941936262d10ecc84511b76a20921</id>
<content type='text'>
When reserving stack space for a new process, make sure we're not
attempting to expand the stack by more than rlimit allows.

This fixes a bug caused by b6a2fea39318e43fee84fa7b0b90d68bed92d2ba ("mm:
variable length argument support") and unmasked by
fc63cf237078c86214abcb2ee9926d8ad289da9b ("exec: setup_arg_pages() fails
to return errors").

This bug means that when limiting the stack to less the 20*PAGE_SIZE (eg.
80K on 4K pages or 'ulimit -s 79') all processes will be killed before
they start.  This is particularly bad with 64K pages, where a ulimit below
1280K will kill every process.

To test, do:

  'ulimit -s 15; ls'

before and after the patch is applied.  Before it's applied, 'ls' should
be killed.  After the patch is applied, 'ls' should no longer be killed.

A stack limit of 15KB since it's small enough to trigger 20*PAGE_SIZE.
Also 15KB not a multiple of PAGE_SIZE, which is a trickier case to handle
correctly with this code.

4K pages should be fine to test with.

[kosaki.motohiro@jp.fujitsu.com: cleanup]
[akpm@linux-foundation.org: cleanup cleanup]
Signed-off-by: Michael Neuling &lt;mikey@neuling.org&gt;
Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Americo Wang &lt;xiyou.wangcong@gmail.com&gt;
Cc: Anton Blanchard &lt;anton@samba.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: James Morris &lt;jmorris@namei.org&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: Serge Hallyn &lt;serue@us.ibm.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
