<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/fs/exec.c, branch v4.0</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v4.0</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v4.0'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2015-01-23T05:22:20Z</updated>
<entry>
<title>fs: create proper filename objects using getname_kernel()</title>
<updated>2015-01-23T05:22:20Z</updated>
<author>
<name>Paul Moore</name>
<email>pmoore@redhat.com</email>
</author>
<published>2015-01-22T05:00:03Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5168910413830435fa3f0a593933a83721ec8bad'/>
<id>urn:sha1:5168910413830435fa3f0a593933a83721ec8bad</id>
<content type='text'>
There are several areas in the kernel that create temporary filename
objects using the following pattern:

	int func(const char *name)
	{
		struct filename *file = { .name = name };
		...
		return 0;
	}

... which for the most part works okay, but it causes havoc within the
audit subsystem as the filename object does not persist beyond the
lifetime of the function.  This patch converts all of these temporary
filename objects into proper filename objects using getname_kernel()
and putname() which ensure that the filename object persists until the
audit subsystem is finished with it.

Also, a special thanks to Al Viro, Guenter Roeck, and Sabrina Dubroca
for helping resolve a difficult kernel panic on boot related to a
use-after-free problem in kern_path_create(); the thread can be seen
at the link below:

 * https://lkml.org/lkml/2015/1/20/710

This patch includes code that was either based on, or directly written
by Al in the above thread.

CC: viro@zeniv.linux.org.uk
CC: linux@roeck-us.net
CC: sd@queasysnail.net
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: Paul Moore &lt;pmoore@redhat.com&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>syscalls: implement execveat() system call</title>
<updated>2014-12-13T20:42:51Z</updated>
<author>
<name>David Drysdale</name>
<email>drysdale@google.com</email>
</author>
<published>2014-12-13T00:57:29Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=51f39a1f0cea1cacf8c787f652f26dfee9611874'/>
<id>urn:sha1:51f39a1f0cea1cacf8c787f652f26dfee9611874</id>
<content type='text'>
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts).  The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.

Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.

Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns.  The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).

Related history:
 - https://lkml.org/lkml/2006/12/27/123 is an example of someone
   realizing that fexecve() is likely to fail in a chroot environment.
 - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
   documenting the /proc requirement of fexecve(3) in its manpage, to
   "prevent other people from wasting their time".
 - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
   problem where a process that did setuid() could not fexecve()
   because it no longer had access to /proc/self/fd; this has since
   been fixed.

This patch (of 4):

Add a new execveat(2) system call.  execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.

In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers.  This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/&lt;fd&gt;" (and
so relies on /proc being mounted).

The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/&lt;fd&gt;"
(for an empty filename) or "/dev/fd/&lt;fd&gt;/&lt;filename&gt;", effectively
reflecting how the executable was found.  This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).

Based on patches by Meredydd Luff.

Signed-off-by: David Drysdale &lt;drysdale@google.com&gt;
Cc: Meredydd Luff &lt;meredydd@senatehouse.org&gt;
Cc: Shuah Khan &lt;shuah.kh@samsung.com&gt;
Cc: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Rich Felker &lt;dalias@aerifal.cx&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Michael Kerrisk &lt;mtk.manpages@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>fs: Do not include mpx.h in exec.c</title>
<updated>2014-11-18T01:01:40Z</updated>
<author>
<name>Dave Hansen</name>
<email>dave.hansen@linux.intel.com</email>
</author>
<published>2014-11-18T00:36:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=abe1e395f6171cb2d07330c690fe0285f7f859e6'/>
<id>urn:sha1:abe1e395f6171cb2d07330c690fe0285f7f859e6</id>
<content type='text'>
We no longer need mpx.h in exec.c.  This will obviously also
break the build for non-x86 builds.  We get the MPX includes that
we need from mmu_context.h now.

Signed-off-by: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Dave Hansen &lt;dave@sr71.net&gt;
Link: http://lkml.kernel.org/r/20141118003608.837015B3@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</content>
</entry>
<entry>
<title>x86, mpx: On-demand kernel allocation of bounds tables</title>
<updated>2014-11-17T23:58:53Z</updated>
<author>
<name>Dave Hansen</name>
<email>dave.hansen@linux.intel.com</email>
</author>
<published>2014-11-14T15:18:29Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=fe3d197f84319d3bce379a9c0dc17b1f48ad358c'/>
<id>urn:sha1:fe3d197f84319d3bce379a9c0dc17b1f48ad358c</id>
<content type='text'>
This is really the meat of the MPX patch set.  If there is one patch to
review in the entire series, this is the one.  There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner.  (small FAQ below).

Long Description:

This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers.  The prctl() is an
explicit signal from userspace.

PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.

PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.

PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (-&gt;bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address.  Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.

Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive.  Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.

==== Why does the hardware even have these Bounds Tables? ====

MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".

They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.

The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.

==== Why not do this in userspace? ====

This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.

Q: Can virtual space simply be reserved for the bounds tables so
   that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
   area needs 4*X GB of virtual space, plus 2GB for the bounds
   directory. If we were to preallocate them for the 128TB of
   user virtual address space, we would need to reserve 512TB+2GB,
   which is larger than the entire virtual address space today.
   This means they can not be reserved ahead of time. Also, a
   single process's pre-popualated bounds directory consumes 2GB
   of virtual *AND* physical memory. IOW, it's completely
   infeasible to prepopulate bounds directories.

Q: Can we preallocate bounds table space at the same time memory
   is allocated which might contain pointers that might eventually
   need bounds tables?
A: This would work if we could hook the site of each and every
   memory allocation syscall. This can be done for small,
   constrained applications. But, it isn't practical at a larger
   scale since a given app has no way of controlling how all the
   parts of the app might allocate memory (think libraries). The
   kernel is really the only place to intercept these calls.

Q: Could a bounds fault be handed to userspace and the tables
   allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
   handler functions and even if mmap() would work it still
   requires locking or nasty tricks to keep track of the
   allocation state there.

Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.

Based-on-patch-by: Qiaowei Ren &lt;qiaowei.ren@intel.com&gt;
Signed-off-by: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen &lt;dave@sr71.net&gt;
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</content>
</entry>
<entry>
<title>handle suicide on late failure exits in execve() in search_binary_handler()</title>
<updated>2014-10-09T06:39:00Z</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2014-05-05T00:11:36Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=19d860a140beac48a1377f179e693abe86a9dac9'/>
<id>urn:sha1:19d860a140beac48a1377f179e693abe86a9dac9</id>
<content type='text'>
... rather than doing that in the guts of -&gt;load_binary().
[updated to fix the bug spotted by Shentino - for SIGSEGV we really need
something stronger than send_sig_info(); again, better do that in one place]

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>fork/exec: cleanup mm initialization</title>
<updated>2014-08-08T22:57:23Z</updated>
<author>
<name>Vladimir Davydov</name>
<email>vdavydov@parallels.com</email>
</author>
<published>2014-08-08T21:21:56Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=41f727fde1fe40efeb4fef6fdce74ff794be5aeb'/>
<id>urn:sha1:41f727fde1fe40efeb4fef6fdce74ff794be5aeb</id>
<content type='text'>
mm initialization on fork/exec is spread all over the place, which makes
the code look inconsistent.

We have mm_init(), which is supposed to init/nullify mm's internals, but
it doesn't init all the fields it should:

 - on fork -&gt;mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
   are zeroed in dup_mmap();

 - on fork -&gt;pmd_huge_pte is zeroed in dup_mm(), immediately before
   calling mm_init();

 - -&gt;cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
   called before mm_init() on both fork and exec;

 - -&gt;context is initialized by init_new_context(), which is called after
   mm_init() on both fork and exec;

Let's consolidate all the initializations in mm_init() to make the code
look cleaner.

Signed-off-by: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>seccomp: implement SECCOMP_FILTER_FLAG_TSYNC</title>
<updated>2014-07-18T19:13:40Z</updated>
<author>
<name>Kees Cook</name>
<email>keescook@chromium.org</email>
</author>
<published>2014-06-05T07:23:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=c2e1f2e30daa551db3c670c0ccfeab20a540b9e1'/>
<id>urn:sha1:c2e1f2e30daa551db3c670c0ccfeab20a540b9e1</id>
<content type='text'>
Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.

This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.

When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.

The race conditions against another thread are:
- requesting TSYNC (already handled by sighand lock)
- performing a clone (already handled by sighand lock)
- changing its filter (already handled by sighand lock)
- calling exec (handled by cred_guard_mutex)
The clone case is assisted by the fact that new threads will have their
seccomp state duplicated from their parent before appearing on the tasklist.

Holding cred_guard_mutex means that seccomp filters cannot be assigned
while in the middle of another thread's exec (potentially bypassing
no_new_privs or similar). The call to de_thread() may kill threads waiting
for the mutex.

Changes across threads to the filter pointer includes a barrier.

Based on patches by Will Drewry.

Suggested-by: Julien Tinnes &lt;jln@chromium.org&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reviewed-by: Andy Lutomirski &lt;luto@amacapital.net&gt;
</content>
</entry>
<entry>
<title>sched: move no_new_privs into new atomic flags</title>
<updated>2014-07-18T19:13:38Z</updated>
<author>
<name>Kees Cook</name>
<email>keescook@chromium.org</email>
</author>
<published>2014-05-21T22:23:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1d4457f99928a968767f6405b4a1f50845aa15fd'/>
<id>urn:sha1:1d4457f99928a968767f6405b4a1f50845aa15fd</id>
<content type='text'>
Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, the flag must be part of an atomic flag
set. This moves the nnp flag into a separate task field, and introduces
accessors.

Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reviewed-by: Andy Lutomirski &lt;luto@amacapital.net&gt;
</content>
</entry>
<entry>
<title>perf: Differentiate exec() and non-exec() comm events</title>
<updated>2014-06-06T05:56:22Z</updated>
<author>
<name>Adrian Hunter</name>
<email>adrian.hunter@intel.com</email>
</author>
<published>2014-05-28T08:45:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=82b897782d10fcc4930c9d4a15b175348fdd2871'/>
<id>urn:sha1:82b897782d10fcc4930c9d4a15b175348fdd2871</id>
<content type='text'>
perf tools like 'perf report' can aggregate samples by comm strings,
which generally works.  However, there are other potential use-cases.
For example, to pair up 'calls' with 'returns' accurately (from branch
events like Intel BTS) it is necessary to identify whether the process
has exec'd.  Although a comm event is generated when an 'exec' happens
it is also generated whenever the comm string is changed on a whim
(e.g. by prctl PR_SET_NAME).  This patch adds a flag to the comm event
to differentiate one case from the other.

In order to determine whether the kernel supports the new flag, a
selection bit named 'exec' is added to struct perf_event_attr.  The
bit does nothing but will cause perf_event_open() to fail if the bit
is set on kernels that do not have it defined.

Signed-off-by: Adrian Hunter &lt;adrian.hunter@intel.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Link: http://lkml.kernel.org/r/537D9EBE.7030806@intel.com
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: Dave Jones &lt;davej@redhat.com&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@kernel.org&gt;
Cc: David Ahern &lt;dsahern@gmail.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>perf: Fix perf_event_comm() vs. exec() assumption</title>
<updated>2014-06-06T05:54:02Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2014-05-21T15:32:19Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e041e328c4b41e1db79bfe5ba9992c2ed771ad19'/>
<id>urn:sha1:e041e328c4b41e1db79bfe5ba9992c2ed771ad19</id>
<content type='text'>
perf_event_comm() assumes that set_task_comm() is only called on
exec(), and in particular that its only called on current.

Neither are true, as Dave reported a WARN triggered by set_task_comm()
being called on !current.

Separate the exec() hook from the comm hook.

Reported-by: Dave Jones &lt;davej@redhat.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@kernel.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20140521153219.GH5226@laptop.programming.kicks-ass.net
[ Build fix. ]
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
</feed>
