<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/fs/exec.c, branch v5.11</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.11</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.11'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2020-12-16T20:10:40Z</updated>
<entry>
<title>Merge branch 'parisc-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux</title>
<updated>2020-12-16T20:10:40Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-12-16T20:10:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5ee863bec794f30bdf7fdf57ce0d9f579b0d1aa3'/>
<id>urn:sha1:5ee863bec794f30bdf7fdf57ce0d9f579b0d1aa3</id>
<content type='text'>
Pull parisc updates from Helge Deller:
 "A change to increase the default maximum stack size on parisc to 100MB
  and the ability to further increase the stack hard limit size at
  runtime with ulimit for newly started processes.

  The other patches fix compile warnings, utilize the Kbuild logic and
  cleanups the parisc arch code"

* 'parisc-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: pci-dma: fix warning unused-function
  parisc/uapi: Use Kbuild logic to provide &lt;asm/types.h&gt;
  parisc: Make user stack size configurable
  parisc: Use _TIF_USER_WORK_MASK in entry.S
  parisc: Drop loops_per_jiffy from per_cpu struct
</content>
</entry>
<entry>
<title>Merge branch 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace</title>
<updated>2020-12-16T03:36:48Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-12-16T03:36:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d01e7f10dae29eba0f9ada82b65d24e035d5b2f9'/>
<id>urn:sha1:d01e7f10dae29eba0f9ada82b65d24e035d5b2f9</id>
<content type='text'>
Pull exec-update-lock update from Eric Biederman:
 "The key point of this is to transform exec_update_mutex into a
  rw_semaphore so readers can be separated from writers.

  This makes it easier to understand what the holders of the lock are
  doing, and makes it harder to contend or deadlock on the lock.

  The real deadlock fix wound up in perf_event_open"

* 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  exec: Transform exec_update_mutex into a rw_semaphore
</content>
</entry>
<entry>
<title>Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace</title>
<updated>2020-12-16T03:29:43Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-12-16T03:29:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=faf145d6f3f3d6f2c066f65602ba9d0a03106915'/>
<id>urn:sha1:faf145d6f3f3d6f2c066f65602ba9d0a03106915</id>
<content type='text'>
Pull execve updates from Eric Biederman:
 "This set of changes ultimately fixes the interaction of posix file
  lock and exec. Fundamentally most of the change is just moving where
  unshare_files is called during exec, and tweaking the users of
  files_struct so that the count of files_struct is not unnecessarily
  played with.

  Along the way fcheck and related helpers were renamed to more
  accurately reflect what they do.

  There were also many other small changes that fell out, as this is the
  first time in a long time much of this code has been touched.

  Benchmarks haven't turned up any practical issues but Al Viro has
  observed a possibility for a lot of pounding on task_lock. So I have
  some changes in progress to convert put_files_struct to always rcu
  free files_struct. That wasn't ready for the merge window so that will
  have to wait until next time"

* 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  exec: Move io_uring_task_cancel after the point of no return
  coredump: Document coredump code exclusively used by cell spufs
  file: Remove get_files_struct
  file: Rename __close_fd_get_file close_fd_get_file
  file: Replace ksys_close with close_fd
  file: Rename __close_fd to close_fd and remove the files parameter
  file: Merge __alloc_fd into alloc_fd
  file: In f_dupfd read RLIMIT_NOFILE once.
  file: Merge __fd_install into fd_install
  proc/fd: In fdinfo seq_show don't use get_files_struct
  bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
  proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
  file: Implement task_lookup_next_fd_rcu
  kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
  proc/fd: In tid_fd_mode use task_lookup_fd_rcu
  file: Implement task_lookup_fd_rcu
  file: Rename fcheck lookup_fd_rcu
  file: Replace fcheck_files with files_lookup_fd_rcu
  file: Factor files_lookup_fd_locked out of fcheck_files
  file: Rename __fcheck_files to files_lookup_fd_raw
  ...
</content>
</entry>
<entry>
<title>exec: Transform exec_update_mutex into a rw_semaphore</title>
<updated>2020-12-10T19:13:32Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2020-12-03T20:12:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f7cfd871ae0c5008d94b6f66834e7845caa93c15'/>
<id>urn:sha1:f7cfd871ae0c5008d94b6f66834e7845caa93c15</id>
<content type='text'>
Recently syzbot reported[0] that there is a deadlock amongst the users
of exec_update_mutex.  The problematic lock ordering found by lockdep
was:

   perf_event_open  (exec_update_mutex -&gt; ovl_i_mutex)
   chown            (ovl_i_mutex       -&gt; sb_writes)
   sendfile         (sb_writes         -&gt; p-&gt;lock)
     by reading from a proc file and writing to overlayfs
   proc_pid_syscall (p-&gt;lock           -&gt; exec_update_mutex)

While looking at possible solutions it occured to me that all of the
users and possible users involved only wanted to state of the given
process to remain the same.  They are all readers.  The only writer is
exec.

There is no reason for readers to block on each other.  So fix
this deadlock by transforming exec_update_mutex into a rw_semaphore
named exec_update_lock that only exec takes for writing.

Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Vasiliy Kulikov &lt;segoon@openwall.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Bernd Edlinger &lt;bernd.edlinger@hotmail.de&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Christopher Yeoh &lt;cyeoh@au1.ibm.com&gt;
Cc: Cyrill Gorcunov &lt;gorcunov@gmail.com&gt;
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@kernel.org&gt;
Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex")
[0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>exec: Move io_uring_task_cancel after the point of no return</title>
<updated>2020-12-10T18:57:13Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2020-11-30T22:58:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9ee1206dcfb9d56503c0de9f8320f7b29c795867'/>
<id>urn:sha1:9ee1206dcfb9d56503c0de9f8320f7b29c795867</id>
<content type='text'>
Now that unshare_files happens in begin_new_exec after the point of no
return, io_uring_task_cancel can also happen later.

Effectively this means io_uring activities for a task are only canceled
when exec succeeds.

Link: https://lkml.kernel.org/r/878saih2op.fsf@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>exec: Simplify unshare_files</title>
<updated>2020-12-10T18:39:32Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2020-11-20T23:14:19Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1f702603e7125a390b5cdf5ce00539781cfcc86a'/>
<id>urn:sha1:1f702603e7125a390b5cdf5ce00539781cfcc86a</id>
<content type='text'>
Now that exec no longer needs to return the unshared files to their
previous value there is no reason to return displaced.

Instead when unshare_fd creates a copy of the file table, call
put_files_struct before returning from unshare_files.

Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
v1: https://lkml.kernel.org/r/20200817220425.9389-2-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-2-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>exec: Move unshare_files to fix posix file locking during exec</title>
<updated>2020-12-10T18:39:24Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2020-11-20T23:14:18Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b6043501289ebf169ae19b810a882d517377302f'/>
<id>urn:sha1:b6043501289ebf169ae19b810a882d517377302f</id>
<content type='text'>
Many moons ago the binfmts were doing some very questionable things
with file descriptors and an unsharing of the file descriptor table
was added to make things better[1][2].  The helper steal_lockss was
added to avoid breaking the userspace programs[3][4][6].

Unfortunately it turned out that steal_locks did not work for network
file systems[5], so it was removed to see if anyone would
complain[7][8].  It was thought at the time that NPTL would not be
affected as the unshare_files happened after the other threads were
killed[8].  Unfortunately because there was an unshare_files in
binfmt_elf.c before the threads were killed this analysis was
incorrect.

This unshare_files in binfmt_elf.c resulted in the unshares_files
happening whenever threads were present.  Which led to unshare_files
being moved to the start of do_execve[9].

Later the problems were rediscovered and the suggested approach was to
readd steal_locks under a different name[10].  I happened to be
reviewing patches and I noticed that this approach was a step
backwards[11].

I proposed simply moving unshare_files[12] and it was pointed
out that moving unshare_files without auditing the code was
also unsafe[13].

There were then several attempts to solve this[14][15][16] and I even
posted this set of changes[17].  Unfortunately because auditing all of
execve is time consuming this change did not make it in at the time.

Well now that I am cleaning up exec I have made the time to read
through all of the binfmts and the only playing with file descriptors
is either the security modules closing them in
security_bprm_committing_creds or is in the generic code in fs/exec.c.
None of it happens before begin_new_exec is called.

So move unshare_files into begin_new_exec, after the point of no
return.  If memory is very very very low and the application calling
exec is sharing file descriptor tables between processes we might fail
past the point of no return.  Which is unfortunate but no different
than any of the other places where we allocate memory after the point
of no return.

This movement allows another process that shares the file table, or
another thread of the same process and that closes files or changes
their close on exec behavior and races with execve to cause some
unexpected things to happen.  There is only one time of check to time
of use race and it is just there so that execve fails instead of
an interpreter failing when it tries to open the file it is supposed
to be interpreting.   Failing later if userspace is being silly is
not a problem.

With this change it the following discription from the removal
of steal_locks[8] finally becomes true.

    Apps using NPTL are not affected, since all other threads are killed before
    execve.

    Apps using LinuxThreads are only affected if they

      - have multiple threads during exec (LinuxThreads doesn't kill other
        threads, the app may do it with pthread_kill_other_threads_np())
      - rely on POSIX locks being inherited across exec

    Both conditions are documented, but not their interaction.

    Apps using clone() natively are affected if they

      - use clone(CLONE_FILES)
      - rely on POSIX locks being inherited across exec

I have investigated some paths to make it possible to solve this
without moving unshare_files but they all look more complicated[18].

Reported-by: Daniel P. Berrangé &lt;berrange@redhat.com&gt;
Reported-by: Jeff Layton &lt;jlayton@redhat.com&gt;
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
[1] 02cda956de0b ("[PATCH] unshare_files"
[2] 04e9bcb4d106 ("[PATCH] use new unshare_files helper")
[3] 088f5d7244de ("[PATCH] add steal_locks helper")
[4] 02c541ec8ffa ("[PATCH] use new steal_locks helper")
[5] https://lkml.kernel.org/r/E1FLIlF-0007zR-00@dorka.pomaz.szeredi.hu
[6] https://lkml.kernel.org/r/0060321191605.GB15997@sorel.sous-sol.org
[7] https://lkml.kernel.org/r/E1FLwjC-0000kJ-00@dorka.pomaz.szeredi.hu
[8] c89681ed7d0e ("[PATCH] remove steal_locks()")
[9] fd8328be874f ("[PATCH] sanitize handling of shared descriptor tables in failing execve()")
[10] https://lkml.kernel.org/r/20180317142520.30520-1-jlayton@kernel.org
[11] https://lkml.kernel.org/r/87r2nwqk73.fsf@xmission.com
[12] https://lkml.kernel.org/r/87bmfgvg8w.fsf@xmission.com
[13] https://lkml.kernel.org/r/20180322111424.GE30522@ZenIV.linux.org.uk
[14] https://lkml.kernel.org/r/20180827174722.3723-1-jlayton@kernel.org
[15] https://lkml.kernel.org/r/20180830172423.21964-1-jlayton@kernel.org
[16] https://lkml.kernel.org/r/20180914105310.6454-1-jlayton@kernel.org
[17] https://lkml.kernel.org/r/87a7ohs5ow.fsf@xmission.com
[18] https://lkml.kernel.org/r/87pn8c1uj6.fsf_-_@x220.int.ebiederm.org
Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
v1: https://lkml.kernel.org/r/20200817220425.9389-1-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-1-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>exec: Don't open code get_close_on_exec</title>
<updated>2020-12-10T18:39:00Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2020-12-09T21:42:57Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=878f12dbb8f514799d126544d59be4d2675caac3'/>
<id>urn:sha1:878f12dbb8f514799d126544d59be4d2675caac3</id>
<content type='text'>
Al Viro pointed out that using the phrase "close_on_exec(fd,
rcu_dereference_raw(current-&gt;files-&gt;fdt))" instead of wrapping it in
rcu_read_lock(), rcu_read_unlock() is a very questionable
optimization[1].

Once wrapped with rcu_read_lock()/rcu_read_unlock() that phrase
becomes equivalent the helper function get_close_on_exec so
simplify the code and make it more robust by simply using
get_close_on_exec.

[1] https://lkml.kernel.org/r/20201207222214.GA4115853@ZenIV.linux.org.uk
Suggested-by: Al Viro &lt;viro@ftp.linux.org.uk&gt;
Link: https://lkml.kernel.org/r/87k0tqr6zi.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>kernel: Implement selective syscall userspace redirection</title>
<updated>2020-12-02T14:07:56Z</updated>
<author>
<name>Gabriel Krisman Bertazi</name>
<email>krisman@collabora.com</email>
</author>
<published>2020-11-27T19:32:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1446e1df9eb183fdf81c3f0715402f1d7595d4cb'/>
<id>urn:sha1:1446e1df9eb183fdf81c3f0715402f1d7595d4cb</id>
<content type='text'>
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS.  This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition.  This is particularly important for Windows
games running over Wine.

The proposed interface looks like this:

  prctl(PR_SET_SYSCALL_USER_DISPATCH, &lt;op&gt;, &lt;off&gt;, &lt;length&gt;, [selector])

The range [&lt;offset&gt;,&lt;offset&gt;+&lt;length&gt;) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector.  This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.

selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.

The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.

Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support.  I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism.  And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.

For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant.  The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access.  In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.

Signed-off-by: Gabriel Krisman Bertazi &lt;krisman@collabora.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Andy Lutomirski &lt;luto@kernel.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com

</content>
</entry>
<entry>
<title>parisc: Make user stack size configurable</title>
<updated>2020-11-11T13:59:08Z</updated>
<author>
<name>Helge Deller</name>
<email>deller@gmx.de</email>
</author>
<published>2020-11-06T18:41:36Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=22ee3ea588dfc84ccb8cea5ea37051dfed91b9b9'/>
<id>urn:sha1:22ee3ea588dfc84ccb8cea5ea37051dfed91b9b9</id>
<content type='text'>
On parisc we need to initialize the memory layout for the user stack at
process start time to a fixed size, which up until now was limited to
the size as given by CONFIG_MAX_STACK_SIZE_MB at compile time.

This hard limit was too small and showed problems when compiling
ruby2.7, qmlcachegen and some Qt packages.

This patch changes two things:
a) It increases the default maximum stack size to 100MB.
b) Users can modify the stack hard limit size with ulimit and then newly
   forked processes will use the given stack size which can even be bigger
   than the default 100MB.

Reported-by: John David Anglin &lt;dave.anglin@bell.net&gt;
Signed-off-by: Helge Deller &lt;deller@gmx.de&gt;
</content>
</entry>
</feed>
