<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/pid.c, branch v5.6</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v5.6</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v5.6'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2020-03-09T22:40:05Z</updated>
<entry>
<title>pid: make ENOMEM return value more obvious</title>
<updated>2020-03-09T22:40:05Z</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2020-03-08T13:29:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=10dab84caf400f2f5f8b010ebb0c7c4272ec5093'/>
<id>urn:sha1:10dab84caf400f2f5f8b010ebb0c7c4272ec5093</id>
<content type='text'>
The alloc_pid() codepath used to be simpler. With the introducation of the
ability to choose specific pids in 49cb2fc42ce4 ("fork: extend clone3() to
support setting a PID") it got more complex. It hasn't been super obvious
that ENOMEM is returned when the pid namespace init process/child subreaper
of the pid namespace has died. As can be seen from multiple attempts to
improve this see e.g. [1] and most recently [2].
We regressed returning ENOMEM in [3] and [2] restored it. Let's add a
comment on top explaining that this is historic and documented behavior and
cannot easily be changed.

[1]: 35f71bc0a09a ("fork: report pid reservation failure properly")
[2]: b26ebfe12f34 ("pid: Fix error return value in some cases")
[3]: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
<entry>
<title>pid: Fix error return value in some cases</title>
<updated>2020-03-08T13:22:58Z</updated>
<author>
<name>Corey Minyard</name>
<email>cminyard@mvista.com</email>
</author>
<published>2020-03-06T17:23:14Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b26ebfe12f34f372cf041c6f801fa49c3fb382c5'/>
<id>urn:sha1:b26ebfe12f34f372cf041c6f801fa49c3fb382c5</id>
<content type='text'>
Recent changes to alloc_pid() allow the pid number to be specified on
the command line.  If set_tid_size is set, then the code scanning the
levels will hard-set retval to -EPERM, overriding it's previous -ENOMEM
value.

After the code scanning the levels, there are error returns that do not
set retval, assuming it is still set to -ENOMEM.

So set retval back to -ENOMEM after scanning the levels.

Fixes: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
Signed-off-by: Corey Minyard &lt;cminyard@mvista.com&gt;
Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Cc: Andrei Vagin &lt;avagin@gmail.com&gt;
Cc: Dmitry Safonov &lt;0x7f454c46@gmail.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Adrian Reber &lt;areber@redhat.com&gt;
Cc: &lt;stable@vger.kernel.org&gt; # 5.5
Link: https://lore.kernel.org/r/20200306172314.12232-1-minyard@acm.org
[christian.brauner@ubuntu.com: fixup commit message]
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
<entry>
<title>pid: Implement pidfd_getfd syscall</title>
<updated>2020-01-13T20:49:36Z</updated>
<author>
<name>Sargun Dhillon</name>
<email>sargun@sargun.me</email>
</author>
<published>2020-01-07T17:59:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=8649c322f75c96e7ced2fec201e123b2b073bf09'/>
<id>urn:sha1:8649c322f75c96e7ced2fec201e123b2b073bf09</id>
<content type='text'>
This syscall allows for the retrieval of file descriptors from other
processes, based on their pidfd. This is possible using ptrace, and
injection of parasitic code to inject code which leverages SCM_RIGHTS
to move file descriptors between a tracee and a tracer. Unfortunately,
ptrace comes with a high cost of requiring the process to be stopped,
and breaks debuggers. This does not require stopping the process under
manipulation.

One reason to use this is to allow sandboxers to take actions on file
descriptors on the behalf of another process. For example, this can be
combined with seccomp-bpf's user notification to do on-demand fd
extraction and take privileged actions. One such privileged action
is binding a socket to a privileged port.

/* prototype */
  /* flags is currently reserved and should be set to 0 */
  int sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);

/* testing */
Ran self-test suite on x86_64

Signed-off-by: Sargun Dhillon &lt;sargun@sargun.me&gt;
Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Reviewed-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Link: https://lore.kernel.org/r/20200107175927.4558-3-sargun@sargun.me
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
<entry>
<title>fork: extend clone3() to support setting a PID</title>
<updated>2019-11-15T22:49:22Z</updated>
<author>
<name>Adrian Reber</name>
<email>areber@redhat.com</email>
</author>
<published>2019-11-15T12:36:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=49cb2fc42ce4b7a656ee605e30c302efaa39c1a7'/>
<id>urn:sha1:49cb2fc42ce4b7a656ee605e30c302efaa39c1a7</id>
<content type='text'>
The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

      PID NS level         Requested PID
        0 (host)              31496
        1                        42
        2                         1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid[2] = 31496;
  set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber &lt;areber@redhat.com&gt;
Reviewed-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reviewed-by: Dmitry Safonov &lt;0x7f454c46@gmail.com&gt;
Acked-by: Andrei Vagin &lt;avagin@gmail.com&gt;
Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
<entry>
<title>pid: use pid_has_task() in pidfd_open()</title>
<updated>2019-10-17T13:37:00Z</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2019-10-17T10:18:32Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1e1d0f0b1a3e3533ea4cd4021eb251e53827c70b'/>
<id>urn:sha1:1e1d0f0b1a3e3533ea4cd4021eb251e53827c70b</id>
<content type='text'>
Use the new pid_has_task() helper in pidfd_open(). This simplifies the
code and avoids taking rcu_read_{lock,unlock}() and leads to overall
nicer code.

Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Link: https://lore.kernel.org/r/20191017101832.5985-5-christian.brauner@ubuntu.com
</content>
</entry>
<entry>
<title>pid: use pid_has_task() in __change_pid()</title>
<updated>2019-10-17T13:36:56Z</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2019-10-17T10:18:30Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1d416a113f0c0da7a3608ffe4d44bd8e9c4859fa'/>
<id>urn:sha1:1d416a113f0c0da7a3608ffe4d44bd8e9c4859fa</id>
<content type='text'>
Replace hlist_empty() with the new pid_has_task() helper which is more
idiomatic, easier to grep for, and unifies how callers perform this
check.

Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Link: https://lore.kernel.org/r/20191017101832.5985-3-christian.brauner@ubuntu.com
</content>
</entry>
<entry>
<title>kernel/pid.c: convert struct pid count to refcount_t</title>
<updated>2019-07-17T02:23:24Z</updated>
<author>
<name>Joel Fernandes (Google)</name>
<email>joel@joelfernandes.org</email>
</author>
<published>2019-07-16T23:30:06Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=f57e515a1b56325a28a0972c632a623a9c84590c'/>
<id>urn:sha1:f57e515a1b56325a28a0972c632a623a9c84590c</id>
<content type='text'>
struct pid's count is an atomic_t field used as a refcount.  Use
refcount_t for it which is basically atomic_t but does additional
checking to prevent use-after-free bugs.

For memory ordering, the only change is with the following:

 -	if ((atomic_read(&amp;pid-&gt;count) == 1) ||
 -	     atomic_dec_and_test(&amp;pid-&gt;count)) {
 +	if (refcount_dec_and_test(&amp;pid-&gt;count)) {
 		kmem_cache_free(ns-&gt;pid_cachep, pid);

Here the change is from: Fully ordered --&gt; RELEASE + ACQUIRE (as per
refcount-vs-atomic.rst) This ACQUIRE should take care of making sure the
free happens after the refcount_dec_and_test().

The above hunk also removes atomic_read() since it is not needed for the
code to work and it is unclear how beneficial it is.  The removal lets
refcount_dec_and_test() check for cases where get_pid() happened before
the object was freed.

Link: http://lkml.kernel.org/r/20190701183826.191936-1-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) &lt;joel@joelfernandes.org&gt;
Reviewed-by: Andrea Parri &lt;andrea.parri@amarulasolutions.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Will Deacon &lt;will.deacon@arm.com&gt;
Cc: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Elena Reshetova &lt;elena.reshetova@intel.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Cc: KJ Tsanaktsidis &lt;ktsanaktsidis@zendesk.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux</title>
<updated>2019-07-11T05:17:21Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-07-11T05:17:21Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=5450e8a316a64cddcbc15f90733ebc78aa736545'/>
<id>urn:sha1:5450e8a316a64cddcbc15f90733ebc78aa736545</id>
<content type='text'>
Pull pidfd updates from Christian Brauner:
 "This adds two main features.

   - First, it adds polling support for pidfds. This allows process
     managers to know when a (non-parent) process dies in a race-free
     way.

     The notification mechanism used follows the same logic that is
     currently used when the parent of a task is notified of a child's
     death. With this patchset it is possible to put pidfds in an
     {e}poll loop and get reliable notifications for process (i.e.
     thread-group) exit.

   - The second feature compliments the first one by making it possible
     to retrieve pollable pidfds for processes that were not created
     using CLONE_PIDFD.

     A lot of processes get created with traditional PID-based calls
     such as fork() or clone() (without CLONE_PIDFD). For these
     processes a caller can currently not create a pollable pidfd. This
     is a problem for Android's low memory killer (LMK) and service
     managers such as systemd.

  Both patchsets are accompanied by selftests.

  It's perhaps worth noting that the work done so far and the work done
  in this branch for pidfd_open() and polling support do already see
  some adoption:

   - Android is in the process of backporting this work to all their LTS
     kernels [1]

   - Service managers make use of pidfd_send_signal but will need to
     wait until we enable waiting on pidfds for full adoption.

   - And projects I maintain make use of both pidfd_send_signal and
     CLONE_PIDFD [2] and will use polling support and pidfd_open() too"

[1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22

[2] https://github.com/lxc/lxc/blob/aab6e3eb73c343231cdde775db938994fc6f2803/src/lxc/start.c#L1753

* tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: add pidfd_open() tests
  arch: wire-up pidfd_open()
  pid: add pidfd_open()
  pidfd: add polling selftests
  pidfd: add polling support
</content>
</entry>
<entry>
<title>pid: add pidfd_open()</title>
<updated>2019-06-28T10:17:55Z</updated>
<author>
<name>Christian Brauner</name>
<email>christian@brauner.io</email>
</author>
<published>2019-05-24T10:43:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=32fcb426ec001cb6d5a4a195091a8486ea77e2df'/>
<id>urn:sha1:32fcb426ec001cb6d5a4a195091a8486ea77e2df</id>
<content type='text'>
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:

int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);

With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a problem for
Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfds for PID-based processes we enable them to adopt this api.

In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.

Signed-off-by: Christian Brauner &lt;christian@brauner.io&gt;
Reviewed-by: David Howells &lt;dhowells@redhat.com&gt;
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Joel Fernandes (Google) &lt;joel@joelfernandes.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Andy Lutomirsky &lt;luto@kernel.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Aleksa Sarai &lt;cyphar@cyphar.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: linux-api@vger.kernel.org
</content>
</entry>
<entry>
<title>pidfd: add polling support</title>
<updated>2019-06-28T10:17:55Z</updated>
<author>
<name>Joel Fernandes (Google)</name>
<email>joel@joelfernandes.org</email>
</author>
<published>2019-04-30T16:21:53Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b53b0b9d9a613c418057f6cb921c2f40a6f78c24'/>
<id>urn:sha1:b53b0b9d9a613c418057f6cb921c2f40a6f78c24</id>
<content type='text'>
This patch adds polling support to pidfd.

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.

For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.

We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
   it in task_struct would not be option in this case because the task can
   be reaped and destroyed before the poll returns.

2. By including the struct pid for the waitqueue means that during
   de_thread(), the new thread group leader automatically gets the new
   waitqueue/pid even though its task_struct is different.

Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.

Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Daniel Colascione &lt;dancol@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Tim Murray &lt;timmurray@google.com&gt;
Cc: Jonathan Kowalski &lt;bl0pbl33p@gmail.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: kernel-team@android.com
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Co-developed-by: Daniel Colascione &lt;dancol@google.com&gt;
Signed-off-by: Daniel Colascione &lt;dancol@google.com&gt;
Signed-off-by: Joel Fernandes (Google) &lt;joel@joelfernandes.org&gt;
Signed-off-by: Christian Brauner &lt;christian@brauner.io&gt;
</content>
</entry>
</feed>
