<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel/exit.c, branch v6.15</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.15</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.15'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2025-03-25T14:56:22Z</updated>
<entry>
<title>exit: fix the usage of delay_group_leader-&gt;exit_code in do_notify_parent() and pidfs_exit()</title>
<updated>2025-03-25T14:56:22Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2025-03-24T17:19:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9133607de37a4887c6f89ed937176a0a0c1ebb17'/>
<id>urn:sha1:9133607de37a4887c6f89ed937176a0a0c1ebb17</id>
<content type='text'>
Consider a process with a group leader L and a sub-thread T.
L does sys_exit(1), then T does sys_exit_group(2).

In this case wait_task_zombie(L) will notice SIGNAL_GROUP_EXIT and use
L-&gt;signal-&gt;group_exit_code, this is correct.

But, before that, do_notify_parent(L) called by release_task(T) will use
L-&gt;exit_code != L-&gt;signal-&gt;group_exit_code, and this is not consistent.
We don't really care, I think that nobody relies on the info which comes
with SIGCHLD, if nothing else SIGCHLD &lt; SIGRTMIN can be queued only once.

But pidfs_exit() is more problematic, I think pidfs_exit_info-&gt;exit_code
should report -&gt;group_exit_code in this case, just like wait_task_zombie().

TODO: with this change we can hopefully cleanup (or may be even kill) the
similar SIGNAL_GROUP_EXIT checks, at least in wait_task_zombie().

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Link: https://lore.kernel.org/r/20250324171941.GA13114@redhat.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>pidfs: cleanup the usage of do_notify_pidfd()</title>
<updated>2025-03-25T13:59:05Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2025-03-23T17:19:55Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0b7747a5477eb22d041997bc085fa8d492fa9b96'/>
<id>urn:sha1:0b7747a5477eb22d041997bc085fa8d492fa9b96</id>
<content type='text'>
If a single-threaded process exits do_notify_pidfd() will be called twice,
from exit_notify() and right after that from do_notify_parent().

1. Change exit_notify() to call do_notify_pidfd() if the exiting task is
   not ptraced and it is not a group leader.

2. Change do_notify_parent() to call do_notify_pidfd() unconditionally.

   If tsk is not ptraced, do_notify_parent() will only be called when it
   is a group-leader and thread_group_empty() is true.

This means that if tsk is ptraced, do_notify_pidfd() will be called from
do_notify_parent() even if tsk is a delay_group_leader(). But this case is
less common, and apart from the unnecessary __wake_up() is harmless.

Granted, this unnecessary __wake_up() can be avoided, but I don't want to
do it in this patch because it's just a consequence of another historical
oddity: we notify the tracer even if !thread_group_empty(), but do_wait()
from debugger can't work until all other threads exit. With or without this
patch we should either eliminate do_notify_parent() in this case, or change
do_wait(WEXITED) to untrace the ptraced delay_group_leader() at least when
ptrace_reparented().

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Link: https://lore.kernel.org/r/20250323171955.GA834@redhat.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'kernel-6.15-rc1.tasklist_lock' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs</title>
<updated>2025-03-24T20:39:27Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-03-24T20:39:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b0cb56cbbdb4754918c28d6d7c294d56e28a3dd5'/>
<id>urn:sha1:b0cb56cbbdb4754918c28d6d7c294d56e28a3dd5</id>
<content type='text'>
Pull tasklist_lock optimizations from Christian Brauner:
 "According to the performance testbots this brings a 23% performance
  increase when creating new processes:

   - Reduce tasklist_lock hold time on exit:
       - Perform add_device_randomness() without tasklist_lock
       - Perform free_pid() calls outside of tasklist_lock

   - Drop irq disablement around pidmap_lock

   - Add some tasklist_lock asserts

   - Call flush_sigqueue() lockless by changing release_task()

   - Don't pointlessly clear TIF_SIGPENDING in __exit_signal() -&gt;
     clear_tsk_thread_flag()"

* tag 'kernel-6.15-rc1.tasklist_lock' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  pid: drop irq disablement around pidmap_lock
  pid: perform free_pid() calls outside of tasklist_lock
  pid: sprinkle tasklist_lock asserts
  exit: hoist get_pid() in release_task() outside of tasklist_lock
  exit: perform add_device_randomness() without tasklist_lock
  exit: kill the pointless __exit_signal()-&gt;clear_tsk_thread_flag(TIF_SIGPENDING)
  exit: change the release_task() paths to call flush_sigqueue() lockless
</content>
</entry>
<entry>
<title>pidfs: improve multi-threaded exec and premature thread-group leader exit polling</title>
<updated>2025-03-20T14:32:43Z</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2025-03-20T13:24:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=0fb482728ba1ee2130eaa461bf551f014447997c'/>
<id>urn:sha1:0fb482728ba1ee2130eaa461bf551f014447997c</id>
<content type='text'>
This is another attempt trying to make pidfd polling for multi-threaded
exec and premature thread-group leader exit consistent.

A quick recap of these two cases:

(1) During a multi-threaded exec by a subthread, i.e., non-thread-group
    leader thread, all other threads in the thread-group including the
    thread-group leader are killed and the struct pid of the
    thread-group leader will be taken over by the subthread that called
    exec. IOW, two tasks change their TIDs.

(2) A premature thread-group leader exit means that the thread-group
    leader exited before all of the other subthreads in the thread-group
    have exited.

Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD.
Any caller that holds a PIDFD_THREAD pidfd to the current thread-group
leader may or may not see an exit notification on the file descriptor
depending on when poll is performed. If the poll is performed before the
exec of the subthread has concluded an exit notification is generated
for the old thread-group leader. If the poll is performed after the exec
of the subthread has concluded no exit notification is generated for the
old thread-group leader.

The correct behavior would be to simply not generate an exit
notification on the struct pid of a subhthread exec because the struct
pid is taken over by the subthread and thus remains alive.

But this is difficult to handle because a thread-group may exit
prematurely as mentioned in (2). In that case an exit notification is
reliably generated but the subthreads may continue to run for an
indeterminate amount of time and thus also may exec at some point.

So far there was no way to distinguish between (1) and (2) internally.
This tiny series tries to address this problem by discarding
PIDFD_THREAD notification on premature thread-group leader exit.

If that works correctly then no exit notifications are generated for a
PIDFD_THREAD pidfd for a thread-group leader until all subthreads have
been reaped. If a subthread should exec aftewards no exit notification
will be generated until that task exits or it creates subthreads and
repeates the cycle.

Co-Developed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Link: https://lore.kernel.org/r/20250320-work-pidfs-thread_group-v4-1-da678ce805bf@kernel.org
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>pidfs: record exit code and cgroupid at exit</title>
<updated>2025-03-05T12:26:12Z</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2025-03-05T10:08:15Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4513522984a0af9c170af991c37fbb483cca654b'/>
<id>urn:sha1:4513522984a0af9c170af991c37fbb483cca654b</id>
<content type='text'>
Record the exit code and cgroupid in release_task() and stash in struct
pidfs_exit_info so it can be retrieved even after the task has been
reaped.

Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-5-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>pid: perform free_pid() calls outside of tasklist_lock</title>
<updated>2025-02-07T10:22:43Z</updated>
<author>
<name>Mateusz Guzik</name>
<email>mjguzik@gmail.com</email>
</author>
<published>2025-02-06T16:44:13Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=7903f907a226058ed99f86e9924e082aea57fc45'/>
<id>urn:sha1:7903f907a226058ed99f86e9924e082aea57fc45</id>
<content type='text'>
As the clone side already executes pid allocation with only pidmap_lock
held, issuing free_pid() while still holding tasklist_lock exacerbates
total hold time of the latter.

More things may show up later which require initial clean up with the
lock held and allow finishing without it. For that reason a struct to
collect such work is added instead of merely passing the pid array.

Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Link: https://lore.kernel.org/r/20250206164415.450051-5-mjguzik@gmail.com
Acked-by: "Liam R. Howlett" &lt;Liam.Howlett@Oracle.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>exit: hoist get_pid() in release_task() outside of tasklist_lock</title>
<updated>2025-02-07T10:22:43Z</updated>
<author>
<name>Mateusz Guzik</name>
<email>mjguzik@gmail.com</email>
</author>
<published>2025-02-06T16:44:11Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6731cd97e60d6f3a30057c0fc513aed543187104'/>
<id>urn:sha1:6731cd97e60d6f3a30057c0fc513aed543187104</id>
<content type='text'>
Reduces hold time as get_pid() contains an atomic.

Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Link: https://lore.kernel.org/r/20250206164415.450051-3-mjguzik@gmail.com
Acked-by: "Liam R. Howlett" &lt;Liam.Howlett@Oracle.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>exit: perform add_device_randomness() without tasklist_lock</title>
<updated>2025-02-07T10:22:43Z</updated>
<author>
<name>Mateusz Guzik</name>
<email>mjguzik@gmail.com</email>
</author>
<published>2025-02-06T16:44:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=1ab2785694971257f6a7dbd5a71bd8402b5fc305'/>
<id>urn:sha1:1ab2785694971257f6a7dbd5a71bd8402b5fc305</id>
<content type='text'>
Parallel calls to add_device_randomness() contend on their own.

The clone side aleady runs outside of tasklist_lock, which in turn means
any caller on the exit side extends the tasklist_lock hold time while
contending on the random-private lock.

Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Link: https://lore.kernel.org/r/20250206164415.450051-2-mjguzik@gmail.com
Acked-by: "Liam R. Howlett" &lt;Liam.Howlett@Oracle.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>exit: kill the pointless __exit_signal()-&gt;clear_tsk_thread_flag(TIF_SIGPENDING)</title>
<updated>2025-02-07T10:20:57Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2025-02-06T15:23:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=43966114b49988ebca6b7d17eb68bad7740e9fb2'/>
<id>urn:sha1:43966114b49988ebca6b7d17eb68bad7740e9fb2</id>
<content type='text'>
It predates the git history and most probably it was never needed. It
doesn't really hurt, but it looks confusing because its purpose is not
clear at all.

release_task(p) is called when this task has already passed exit_notify()
so signal_pending(p) == T shouldn't make any difference.

And even _if_ there were a subtle reason to clear TIF_SIGPENDING after
exit_notify(), this clear_tsk_thread_flag() can't help anyway.  If the
exiting task is a group leader or if it is ptraced, release_task() will
be likely called when this task has already done its last schedule() from
do_task_dead().

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Link: https://lore.kernel.org/r/20250206152334.GB14620@redhat.com
Acked-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>exit: change the release_task() paths to call flush_sigqueue() lockless</title>
<updated>2025-02-07T10:20:57Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2025-02-06T15:23:14Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=fb3bbcfe344e64a46574a638b051ffd78762c12d'/>
<id>urn:sha1:fb3bbcfe344e64a46574a638b051ffd78762c12d</id>
<content type='text'>
A task can block a signal, accumulate up to RLIMIT_SIGPENDING sigqueues,
and exit. In this case __exit_signal()-&gt;flush_sigqueue() called with irqs
disabled can trigger a hard lockup, see
https://lore.kernel.org/all/20190322114917.GC28876@redhat.com/

Fortunately, after the recent posixtimer changes sys_timer_delete() paths
no longer try to clear SIGQUEUE_PREALLOC and/or free tmr-&gt;sigq, and after
the exiting task passes __exit_signal() lock_task_sighand() can't succeed
and pid_task(tmr-&gt;it_pid) will return NULL.

This means that after __exit_signal(tsk) nobody can play with tsk-&gt;pending
or (if group_dead) with tsk-&gt;signal-&gt;shared_pending, so release_task() can
safely call flush_sigqueue() after write_unlock_irq(&amp;tasklist_lock).

TODO:
	- we can probably shift posix_cpu_timers_exit() as well
	- do_sigaction() can hit the similar problem

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Link: https://lore.kernel.org/r/20250206152314.GA14620@redhat.com
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
</feed>
