<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/pid.c, branch v5.7</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.7</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.7'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2020-04-10T19:59:56Z</updated>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace</title>
<updated>2020-04-10T19:59:56Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-04-10T19:59:56Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=87ad46e601340394cd75c1c79b19ca906f82c543'/>
<id>urn:sha1:87ad46e601340394cd75c1c79b19ca906f82c543</id>
<content type='text'>
Pull proc fix from Eric Biederman:
 "A brown paper bag slipped through my proc changes, and syzcaller
  caught it when the code ended up in your tree.

  I have opted to fix it the simplest cleanest way I know how, so there
  is no reasonable chance for the bug to repeat"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Use a dedicated lock in struct pid
</content>
</entry>
<entry>
<title>proc: Use a dedicated lock in struct pid</title>
<updated>2020-04-09T17:15:35Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2020-04-07T14:43:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=63f818f46af9f8b3f17b9695501e8d08959feb60'/>
<id>urn:sha1:63f818f46af9f8b3f17b9695501e8d08959feb60</id>
<content type='text'>
syzbot wrote:
&gt; ========================================================
&gt; WARNING: possible irq lock inversion dependency detected
&gt; 5.6.0-syzkaller #0 Not tainted
&gt; --------------------------------------------------------
&gt; swapper/1/0 just changed the state of lock:
&gt; ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
&gt; but this lock took another, SOFTIRQ-unsafe lock in the past:
&gt;  (&amp;pid-&gt;wait_pidfd){+.+.}-{2:2}
&gt;
&gt;
&gt; and interrupts could create inverse lock ordering between them.
&gt;
&gt;
&gt; other info that might help us debug this:
&gt;  Possible interrupt unsafe locking scenario:
&gt;
&gt;        CPU0                    CPU1
&gt;        ----                    ----
&gt;   lock(&amp;pid-&gt;wait_pidfd);
&gt;                                local_irq_disable();
&gt;                                lock(tasklist_lock);
&gt;                                lock(&amp;pid-&gt;wait_pidfd);
&gt;   &lt;Interrupt&gt;
&gt;     lock(tasklist_lock);
&gt;
&gt;  *** DEADLOCK ***
&gt;
&gt; 4 locks held by swapper/1/0:

The problem is that because wait_pidfd.lock is taken under the tasklist
lock.  It must always be taken with irqs disabled as tasklist_lock can be
taken from interrupt context and if wait_pidfd.lock was already taken this
would create a lock order inversion.

Oleg suggested just disabling irqs where I have added extra calls to
wait_pidfd.lock.  That should be safe and I think the code will eventually
do that.  It was rightly pointed out by Christian that sharing the
wait_pidfd.lock was a premature optimization.

It is also true that my pre-merge window testing was insufficient.  So
remove the premature optimization and give struct pid a dedicated lock of
it's own for struct pid things.  I have verified that lockdep sees all 3
paths where we take the new pid-&gt;lock and lockdep does not complain.

It is my current day dream that one day pid-&gt;lock can be used to guard the
task lists as well and then the tasklist_lock won't need to be held to
deliver signals.  That will require taking pid-&gt;lock with irqs disabled.

Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace</title>
<updated>2020-04-02T18:22:17Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-04-02T18:22:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d987ca1c6b7e22fbd30664111e85cec7aa66000d'/>
<id>urn:sha1:d987ca1c6b7e22fbd30664111e85cec7aa66000d</id>
<content type='text'>
Pull exec/proc updates from Eric Biederman:
 "This contains two significant pieces of work: the work to sort out
  proc_flush_task, and the work to solve a deadlock between strace and
  exec.

  Fixing proc_flush_task so that it no longer requires a persistent
  mount makes improvements to proc possible. The removal of the
  persistent mount solves an old regression that that caused the hidepid
  mount option to only work on remount not on mount. The regression was
  found and reported by the Android folks. This further allows Alexey
  Gladkov's work making proc mount options specific to an individual
  mount of proc to move forward.

  The work on exec starts solving a long standing issue with exec that
  it takes mutexes of blocking userspace applications, which makes exec
  extremely deadlock prone. For the moment this adds a second mutex with
  a narrower scope that handles all of the easy cases. Which makes the
  tricky cases easy to spot. With a little luck the code to solve those
  deadlocks will be ready by next merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
  signal: Extend exec_id to 64bits
  pidfd: Use new infrastructure to fix deadlocks in execve
  perf: Use new infrastructure to fix deadlocks in execve
  proc: io_accounting: Use new infrastructure to fix deadlocks in execve
  proc: Use new infrastructure to fix deadlocks in execve
  kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
  kernel: doc: remove outdated comment cred.c
  mm: docs: Fix a comment in process_vm_rw_core
  selftests/ptrace: add test cases for dead-locks
  exec: Fix a deadlock in strace
  exec: Add exec_update_mutex to replace cred_guard_mutex
  exec: Move exec_mmap right after de_thread in flush_old_exec
  exec: Move cleanup of posix timers on exec out of de_thread
  exec: Factor unshare_sighand out of de_thread and call it separately
  exec: Only compute current once in flush_old_exec
  pid: Improve the comment about waiting in zap_pid_ns_processes
  proc: Remove the now unnecessary internal mount of proc
  uml: Create a private mount of proc for mconsole
  uml: Don't consult current to find the proc_mnt in mconsole_proc
  proc: Use a list of inodes to flush from proc
  ...
</content>
</entry>
<entry>
<title>pidfd: Use new infrastructure to fix deadlocks in execve</title>
<updated>2020-03-25T15:04:01Z</updated>
<author>
<name>Bernd Edlinger</name>
<email>bernd.edlinger@hotmail.de</email>
</author>
<published>2020-03-21T02:46:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=501f9328bf5c6b5e4863da4b50e0e86792de3aa9'/>
<id>urn:sha1:501f9328bf5c6b5e4863da4b50e0e86792de3aa9</id>
<content type='text'>
This changes __pidfd_fget to use the new exec_update_mutex
instead of cred_guard_mutex.

This should be safe, as the credentials do not change
before exec_update_mutex is locked.  Therefore whatever
file access is possible with holding the cred_guard_mutex
here is also possbile with the exec_update_mutex.

Signed-off-by: Bernd Edlinger &lt;bernd.edlinger@hotmail.de&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>pid: make ENOMEM return value more obvious</title>
<updated>2020-03-09T22:40:05Z</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2020-03-08T13:29:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=10dab84caf400f2f5f8b010ebb0c7c4272ec5093'/>
<id>urn:sha1:10dab84caf400f2f5f8b010ebb0c7c4272ec5093</id>
<content type='text'>
The alloc_pid() codepath used to be simpler. With the introducation of the
ability to choose specific pids in 49cb2fc42ce4 ("fork: extend clone3() to
support setting a PID") it got more complex. It hasn't been super obvious
that ENOMEM is returned when the pid namespace init process/child subreaper
of the pid namespace has died. As can be seen from multiple attempts to
improve this see e.g. [1] and most recently [2].
We regressed returning ENOMEM in [3] and [2] restored it. Let's add a
comment on top explaining that this is historic and documented behavior and
cannot easily be changed.

[1]: 35f71bc0a09a ("fork: report pid reservation failure properly")
[2]: b26ebfe12f34 ("pid: Fix error return value in some cases")
[3]: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
<entry>
<title>pid: Fix error return value in some cases</title>
<updated>2020-03-08T13:22:58Z</updated>
<author>
<name>Corey Minyard</name>
<email>cminyard@mvista.com</email>
</author>
<published>2020-03-06T17:23:14Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b26ebfe12f34f372cf041c6f801fa49c3fb382c5'/>
<id>urn:sha1:b26ebfe12f34f372cf041c6f801fa49c3fb382c5</id>
<content type='text'>
Recent changes to alloc_pid() allow the pid number to be specified on
the command line.  If set_tid_size is set, then the code scanning the
levels will hard-set retval to -EPERM, overriding it's previous -ENOMEM
value.

After the code scanning the levels, there are error returns that do not
set retval, assuming it is still set to -ENOMEM.

So set retval back to -ENOMEM after scanning the levels.

Fixes: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
Signed-off-by: Corey Minyard &lt;cminyard@mvista.com&gt;
Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Cc: Andrei Vagin &lt;avagin@gmail.com&gt;
Cc: Dmitry Safonov &lt;0x7f454c46@gmail.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Adrian Reber &lt;areber@redhat.com&gt;
Cc: &lt;stable@vger.kernel.org&gt; # 5.5
Link: https://lore.kernel.org/r/20200306172314.12232-1-minyard@acm.org
[christian.brauner@ubuntu.com: fixup commit message]
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
<entry>
<title>proc: Remove the now unnecessary internal mount of proc</title>
<updated>2020-02-28T18:06:14Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2020-02-20T14:08:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=69879c01a0c3f70e0887cfb4d9ff439814361e46'/>
<id>urn:sha1:69879c01a0c3f70e0887cfb4d9ff439814361e46</id>
<content type='text'>
There remains no more code in the kernel using pids_ns-&gt;proc_mnt,
therefore remove it from the kernel.

The big benefit of this change is that one of the most error prone and
tricky parts of the pid namespace implementation, maintaining kernel
mounts of proc is removed.

In addition removing the unnecessary complexity of the kernel mount
fixes a regression that caused the proc mount options to be ignored.
Now that the initial mount of proc comes from userspace, those mount
options are again honored.  This fixes Android's usage of the proc
hidepid option.

Reported-by: Alistair Strachan &lt;astrachan@google.com&gt;
Fixes: e94591d0d90c ("proc: Convert proc_mount to use mount_ns.")
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>proc: Use a list of inodes to flush from proc</title>
<updated>2020-02-24T16:14:44Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2020-02-20T00:22:26Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=7bc3e6e55acf065500a24621f3b313e7e5998acf'/>
<id>urn:sha1:7bc3e6e55acf065500a24621f3b313e7e5998acf</id>
<content type='text'>
Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization.  The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died.  When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid.  Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid.  Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place.  This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

v2: Initialize pid-&gt;inodes.  I somehow failed to get that
    initialization into the initial version of the patch.  A boot
    failure was reported by "kernel test robot &lt;lkp@intel.com&gt;", and
    failure to initialize that pid-&gt;inodes matches all of the reported
    symptoms.

Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>pid: Implement pidfd_getfd syscall</title>
<updated>2020-01-13T20:49:36Z</updated>
<author>
<name>Sargun Dhillon</name>
<email>sargun@sargun.me</email>
</author>
<published>2020-01-07T17:59:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8649c322f75c96e7ced2fec201e123b2b073bf09'/>
<id>urn:sha1:8649c322f75c96e7ced2fec201e123b2b073bf09</id>
<content type='text'>
This syscall allows for the retrieval of file descriptors from other
processes, based on their pidfd. This is possible using ptrace, and
injection of parasitic code to inject code which leverages SCM_RIGHTS
to move file descriptors between a tracee and a tracer. Unfortunately,
ptrace comes with a high cost of requiring the process to be stopped,
and breaks debuggers. This does not require stopping the process under
manipulation.

One reason to use this is to allow sandboxers to take actions on file
descriptors on the behalf of another process. For example, this can be
combined with seccomp-bpf's user notification to do on-demand fd
extraction and take privileged actions. One such privileged action
is binding a socket to a privileged port.

/* prototype */
  /* flags is currently reserved and should be set to 0 */
  int sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);

/* testing */
Ran self-test suite on x86_64

Signed-off-by: Sargun Dhillon &lt;sargun@sargun.me&gt;
Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Reviewed-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Link: https://lore.kernel.org/r/20200107175927.4558-3-sargun@sargun.me
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
<entry>
<title>fork: extend clone3() to support setting a PID</title>
<updated>2019-11-15T22:49:22Z</updated>
<author>
<name>Adrian Reber</name>
<email>areber@redhat.com</email>
</author>
<published>2019-11-15T12:36:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=49cb2fc42ce4b7a656ee605e30c302efaa39c1a7'/>
<id>urn:sha1:49cb2fc42ce4b7a656ee605e30c302efaa39c1a7</id>
<content type='text'>
The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

      PID NS level         Requested PID
        0 (host)              31496
        1                        42
        2                         1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid[2] = 31496;
  set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber &lt;areber@redhat.com&gt;
Reviewed-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reviewed-by: Dmitry Safonov &lt;0x7f454c46@gmail.com&gt;
Acked-by: Andrei Vagin &lt;avagin@gmail.com&gt;
Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
</feed>
