summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorLines
2025-11-05bpf: support instructions arrays with constants blindingAnton Protopopov-1/+30
When bpf_jit_harden is enabled, all constants in the BPF code are blinded to prevent JIT spraying attacks. This happens during JIT phase. Adjust all the related instruction arrays accordingly. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-6-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-05bpf, x86: add new map type: instructions arrayAnton Protopopov-1/+354
On bpf(BPF_PROG_LOAD) syscall user-supplied BPF programs are translated by the verifier into "xlated" BPF programs. During this process the original instructions offsets might be adjusted and/or individual instructions might be replaced by new sets of instructions, or deleted. Add a new BPF map type which is aimed to keep track of how, for a given program, the original instructions were relocated during the verification. Also, besides keeping track of the original -> xlated mapping, make x86 JIT to build the xlated -> jitted mapping for every instruction listed in an instruction array. This is required for every future application of instruction arrays: static keys, indirect jumps and indirect calls. A map of the BPF_MAP_TYPE_INSN_ARRAY type must be created with a u32 keys and value of size 8. The values have different semantics for userspace and for BPF space. For userspace a value consists of two u32 values – xlated and jitted offsets. For BPF side the value is a real pointer to a jitted instruction. On map creation/initialization, before loading the program, each element of the map should be initialized to point to an instruction offset within the program. Before the program load such maps should be made frozen. After the program verification xlated and jitted offsets can be read via the bpf(2) syscall. If a tracked instruction is removed by the verifier, then the xlated offset is set to (u32)-1 which is considered to be too big for a valid BPF program offset. One such a map can, obviously, be used to track one and only one BPF program. If the verification process was unsuccessful, then the same map can be re-used to verify the program with a different log level. However, if the program was loaded fine, then such a map, being frozen in any case, can't be reused by other programs even after the program release. Example. Consider the following original and xlated programs: Original prog: Xlated prog: 0: r1 = 0x0 0: r1 = 0 1: *(u32 *)(r10 - 0x4) = r1 1: *(u32 *)(r10 -4) = r1 2: r2 = r10 2: r2 = r10 3: r2 += -0x4 3: r2 += -4 4: r1 = 0x0 ll 4: r1 = map[id:88] 6: call 0x1 6: r1 += 272 7: r0 = *(u32 *)(r2 +0) 8: if r0 >= 0x1 goto pc+3 9: r0 <<= 3 10: r0 += r1 11: goto pc+1 12: r0 = 0 7: r6 = r0 13: r6 = r0 8: if r6 == 0x0 goto +0x2 14: if r6 == 0x0 goto pc+4 9: call 0x76 15: r0 = 0xffffffff8d2079c0 17: r0 = *(u64 *)(r0 +0) 10: *(u64 *)(r6 + 0x0) = r0 18: *(u64 *)(r6 +0) = r0 11: r0 = 0x0 19: r0 = 0x0 12: exit 20: exit An instruction array map, containing, e.g., instructions [0,4,7,12] will be translated by the verifier to [0,4,13,20]. A map with index 5 (the middle of 16-byte instruction) or indexes greater than 12 (outside the program boundaries) would be rejected. The functionality provided by this patch will be extended in consequent patches to implement BPF Static Keys, indirect jumps, and indirect calls. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-06rcutorture: Remove redundant rcutorture_one_extend() from rcu_torture_one_read()Paul E. McKenney-3/+1
This commit removes a harmless but potentially confusing invocation of rcutorture_one_extend() within rcu_torture_one_read(). The immediately preceding call to rcu_torture_one_read_start() already does this cleanup, and the other call to rcu_torture_one_read_start() already relies on this. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-06locktorture: Fix memory leak in param_set_cpumask()Wang Liang-2/+6
With CONFIG_CPUMASK_OFFSTACK=y, the 'bind_writers' buffer is allocated via alloc_cpumask_var() in param_set_cpumask(). But it is not freed, when setting the module parameter multiple times by sysfs interface or removing module. Below kmemleak trace is seen for this issue: unreferenced object 0xffff888100aabff8 (size 8): comm "bash", pid 323, jiffies 4295059233 hex dump (first 8 bytes): 07 00 00 00 00 00 00 00 ........ backtrace (crc ac50919): __kmalloc_node_noprof+0x2e5/0x420 alloc_cpumask_var_node+0x1f/0x30 param_set_cpumask+0x26/0xb0 [locktorture] param_attr_store+0x93/0x100 module_attr_store+0x1b/0x30 kernfs_fop_write_iter+0x114/0x1b0 vfs_write+0x300/0x410 ksys_write+0x60/0xd0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f This issue can be reproduced by: insmod locktorture.ko bind_writers=1 rmmod locktorture or: insmod locktorture.ko bind_writers=1 echo 2 > /sys/module/locktorture/parameters/bind_writers Considering that setting the module parameter 'bind_writers' or 'bind_readers' by sysfs interface has no real effect, set the parameter permissions to 0444. To fix the memory leak when removing module, free 'bind_writers' and 'bind_readers' memory in lock_torture_cleanup(). Fixes: 73e341242483 ("locktorture: Add readers_bind and writers_bind module parameters") Suggested-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Require special srcu_struct define/init for SRCU-fast readersPaul E. McKenney-0/+1
This commit adds CONFIG_PROVE_RCU=y checking to enforce the new rule that srcu_struct structures passed to srcu_read_lock_fast() and other SRCU-fast read-side markers be either initialized with init_srcu_struct_fast() on the one hand or defined using either DEFINE_SRCU_FAST() or DEFINE_STATIC_SRCU_FAST(). This will enable removal of the non-debug read-side checks from srcu_read_lock_fast() and friends, which on my laptop provides a 25% speedup (which admittedly amounts to about half a nanosecond, but when tracing fastpaths...) Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05rcutorture: Exercise DEFINE_STATIC_SRCU_FAST() and init_srcu_struct_fast()Paul E. McKenney-4/+15
This commit updates the initialization for the "srcu" and "srcud" torture types to use DEFINE_STATIC_SRCU_FAST() and init_srcu_struct_fast(), respectively, when reader_flavor is equal to SRCU_READ_FLAVOR_FAST. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Make grace-period determination use ssp->srcu_reader_flavorPaul E. McKenney-1/+1
This commit causes the srcu_readers_unlock_idx() function to take the srcu_struct structure's ->srcu_reader_flavor field into account. This ensures that structures defined via DEFINE_SRCU_FAST( or initialized via init_srcu_struct_fast() have their grace periods use synchronize_srcu() or synchronize_srcu_expedited() instead of smp_mb(), even before the first SRCU reader has been entered. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Create a DEFINE_SRCU_FAST()Paul E. McKenney-2/+34
This commit creates DEFINE_SRCU_FAST() and DEFINE_STATIC_SRCU_FAST() macros that are similar to DEFINE_SRCU() and DEFINE_STATIC_SRCU(), but which create srcu_struct structures that are usable only by readers initiated by srcu_read_lock_fast() and friends. This commit does make DEFINE_SRCU_FAST() available to modules, in which case the per-CPU srcu_data structures are not created at compile time, but rather at module-load time. This means that the >srcu_reader_flavor field of the srcu_data structure is not available. Therefore, this commit instead creates an ->srcu_reader_flavor field in the srcu_struct structure, adds arguments to the DEFINE_SRCU()-related macros to initialize this new field, and extends the checks in the __srcu_check_read_flavor() function to include this new field. This commit also allows dynamically allocated srcu_struct structure to be marked for SRCU-fast readers. It does so by defining a new init_srcu_struct_fast() function that marks the specified srcu_struct structure for use by srcu_read_lock_fast() and friends. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05rcutorture: Test srcu_expedite_current()Paul E. McKenney-0/+12
This commit adds a ->exp_current member to the rcu_torture_ops structure to test the srcu_expedite_current() function. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Create an srcu_expedite_current() functionPaul E. McKenney-0/+58
This commit creates an srcu_expedite_current() function that expedites the current (and possibly the next) SRCU grace period for the specified srcu_struct structure. This functionality will be inherited by RCU Tasks Trace courtesy of its mapping to SRCU fast. If the current SRCU grace period is already waiting, that wait will complete before the expediting takes effect. If there is no SRCU grace period in flight, this function might well create one. [ paulmck: Apply Zqiang feedback for PREEMPT_RT use. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Permit Tiny SRCU srcu_read_unlock() with interrupts disabledPaul E. McKenney-4/+9
The current Tiny SRCU implementation of srcu_read_unlock() awakens the grace-period processing when exiting the outermost SRCU read-side critical section. However, not all Linux-kernel configurations and contexts permit swake_up_one() to be invoked while interrupts are disabled, and this can result in indefinitely extended SRCU grace periods. This commit therefore only invokes swake_up_one() when interrupts are enabled, and introduces polling to the grace-period workqueue handler. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Zqiang <qiang.zhang@linux.dev> Closes: https://lore.kernel.org/oe-lkp/202508261642.b15eefbb-lkp@intel.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05trace: use override credential guardChristian Brauner-12/+5
Use override credential guards for scoped credential override with automatic restoration on scope exit. Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-12-b447b82f2c9b@kernel.org Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05trace: use prepare credential guardChristian Brauner-4/+1
Use the prepare credential guard for allocating a new set of credentials. Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-11-b447b82f2c9b@kernel.org Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate ↵Tejun Heo-0/+4
races The warned bitfields in struct scx_sched are updated racily from concurrent CPUs causing RMW races, which is fine for these boolean warning flags. Add a comment marking this area to prevent future fields that can't tolerate racy updates from being added here. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Globally track isolated_cpus updateWaiman Long-38/+35
The current cpuset code passes a local isolcpus_updated flag around in a number of functions to determine if external isolation related cpumasks like wq_unbound_cpumask should be updated. It is a bit cumbersome and makes the code more complex. Simplify the code by using a global boolean flag "isolated_cpus_updating" to track this. This flag will be set in isolated_cpus_update() and cleared in update_isolation_cpumasks(). No functional change is expected. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partitionWaiman Long-4/+6
Commit 4a74e418881f ("cgroup/cpuset: Check partition conflict with housekeeping setup") is supposed to ensure that domain isolated CPUs designated by the "isolcpus" boot command line option stay either in root partition or in isolated partitions. However, the required check wasn't implemented when a remote partition was created or when an existing partition changed type from "root" to "isolated". Even though this is a relatively minor issue, we still need to add the required prstate_housekeeping_conflict() call in the right places to ensure that the rule is strictly followed. The following steps can be used to reproduce the problem before this fix. # fmt -1 /proc/cmdline | grep isolcpus isolcpus=9 # cd /sys/fs/cgroup/ # echo +cpuset > cgroup.subtree_control # mkdir test # echo 9 > test/cpuset.cpus # echo isolated > test/cpuset.cpus.partition # cat test/cpuset.cpus.partition isolated # cat test/cpuset.cpus.effective 9 # echo root > test/cpuset.cpus.partition # cat test/cpuset.cpus.effective 9 # cat test/cpuset.cpus.partition root With this fix, the last few steps will become: # echo root > test/cpuset.cpus.partition # cat test/cpuset.cpus.effective 0-8,10-95 # cat test/cpuset.cpus.partition root invalid (partition config conflicts with housekeeping setup) Reported-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Move up prstate_housekeeping_conflict() helperWaiman Long-20/+20
Move up the prstate_housekeeping_conflict() helper so that it can be used in remote partition code. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeepingWaiman Long-1/+73
Currently the user can set up isolated cpus via cpuset and nohz_full in such a way that leaves no housekeeping CPU (i.e. no CPU that is neither domain isolated nor nohz full). This can be a problem for other subsystems (e.g. the timer wheel imgration). Prevent this configuration by blocking any assignation that would cause the union of domain isolated cpus and nohz_full to covers all CPUs. [longman: Remove isolated_cpus_should_update() and rewrite the checking in update_prstate() and update_parent_effective_cpumask()] Originally-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to ↵Gabriele Monaco-6/+6
update_isolation_cpumasks() update_unbound_workqueue_cpumask() updates unbound workqueues settings when there's a change in isolated CPUs, but it can be used for other subsystems requiring updated when isolated CPUs change. Generalise the name to update_isolation_cpumasks() to prepare for other functions unrelated to workqueues to be called in that spot. [longman: Change the function name to update_isolation_cpumasks()] Acked-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-04bpf: Convert bpf_sock_addr_kern "uaddr" to sockaddr_unsizedKees Cook-4/+4
Change struct bpf_sock_addr_kern to use sockaddr_unsized for the "uaddr" field instead of sockaddr. This improves type safety in the BPF cgroup socket address filtering code. The casting in __cgroup_bpf_run_filter_sock_addr() is updated to match the new type, removing an unnecessary cast in the initialization and updating the conditional assignment to use the appropriate sockaddr_unsized cast. Additionally rename the "unspec" variable to "storage" to better align with its usage. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-7-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04bpf: Convert cgroup sockaddr filters to use sockaddr_unsized consistentlyKees Cook-2/+2
Update BPF cgroup sockaddr filtering infrastructure to use sockaddr_unsized consistently throughout the call chain, removing redundant explicit casts from callers. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-6-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04bpf: add _impl suffix for bpf_stream_vprintk() kfuncMykyta Yatsenko-2/+3
Rename bpf_stream_vprintk() to bpf_stream_vprintk_impl(). This makes bpf_stream_vprintk() follow the already established "_impl" suffix-based naming convention for kfuncs with the bpf_prog_aux argument provided by the verifier implicitly. This convention will be taken advantage of with the upcoming KF_IMPLICIT_ARGS feature to preserve backwards compatibility to BPF programs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20251104-implv2-v3-2-4772b9ae0e06@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
2025-11-04bpf:add _impl suffix for bpf_task_work_schedule* kfuncsMykyta Yatsenko-16/+20
Rename: bpf_task_work_schedule_resume()->bpf_task_work_schedule_resume_impl() bpf_task_work_schedule_signal()->bpf_task_work_schedule_signal_impl() This aligns task work scheduling kfuncs with the established naming scheme for kfuncs with the bpf_prog_aux argument provided by the verifier implicitly. This convention will be taken advantage of with the upcoming KF_IMPLICIT_ARGS feature to preserve backwards compatibility to BPF programs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20251104-implv2-v3-1-4772b9ae0e06@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
2025-11-04sched_ext: Minor cleanups to scx_task_iterTejun Heo-6/+5
- Use memset() in scx_task_iter_start() instead of zeroing fields individually. - In scx_task_iter_next(), move __scx_task_iter_maybe_relock() after the batch check which is simpler. - Update comment to reflect that tasks are removed from scx_tasks when dead (commit 7900aa699c34 ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()")). No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-04sched_ext: Move __SCX_DSQ_ITER_ALL_FLAGS BUILD_BUG_ON to the right placeTejun Heo-3/+2
The BUILD_BUG_ON() which checks that __SCX_DSQ_ITER_ALL_FLAGS doesn't overlap with the private lnode bits was in scx_task_iter_start() which has nothing to do with DSQ iteration. Move it to bpf_iter_scx_dsq_new() where it belongs. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-04Merge branch 'topic/func-profiler-offset' of ↵Steven Rostedt-154/+197
git://git.kernel.org/pub/scm/linux/kernel/git/mhiramat/linux into trace/trace/core Updates to the function profiler adds new options to tracefs. The options are currently defined by an enum as flags. The added options brings the number of options over 32, which means they can no longer be held in a 32 bit enum. The TRACE_ITER_* flags are converted to a macro TRACE_ITER(*) to allow the creation of options to still be done by macros. This change is intrusive, as it affects all TRACE_ITER* options throughout the trace code. Merge the branch that added these options and converted the TRACE_ITER_* enum into a TRACE_ITER(*) macro, to allow the topic branches to still be developed without conflict. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-04tracing: Add an option to show symbols in _text+offset for function profilerMasami Hiramatsu (Google)-4/+38
Function profiler shows the hit count of each function using its symbol name. However, there are some same-name local symbols, which we can not distinguish. To solve this issue, this introduces an option to show the symbols in "_text+OFFSET" format. This can avoid exposing the random shift of KASLR. The functions in modules are shown as "MODNAME+OFFSET" where the offset is from ".text". E.g. for the kernel text symbols, specify vmlinux and the output to addr2line, you can find the actual function and source info; $ addr2line -fie vmlinux _text+3078208 __balance_callbacks kernel/sched/core.c:5064 for modules, specify the module file and .text+OFFSET; $ addr2line -fie samples/trace_events/trace-events-sample.ko .text+8224 do_simple_thread_func samples/trace_events/trace-events-sample.c:23 Link: https://lore.kernel.org/all/176187878064.994619.8878296550240416558.stgit@devnote2/ Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-11-04tracing: Allow tracer to add more than 32 optionsMasami Hiramatsu (Google)-150/+159
Since enum trace_iterator_flags is 32bit, the max number of the option flags is limited to 32 and it is fully used now. To add a new option, we need to expand it. So replace the TRACE_ITER_##flag with TRACE_ITER(flag) macro which is 64bit bitmask. Link: https://lore.kernel.org/all/176187877103.994619.166076000668757232.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-11-04cgroup: use credential guards in cgroup_attach_permissions()Christian Brauner-6/+4
Use credential guards for scoped credential override with automatic restoration on scope exit. Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-15-a3e156839e7f@kernel.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04act: use credential guards in acct_write_process()Christian Brauner-16/+13
Use credential guards for scoped credential override with automatic restoration on scope exit. Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-14-a3e156839e7f@kernel.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04cred: make init_cred staticChristian Brauner-27/+0
There's zero need to expose struct init_cred. The very few places that need access can just go through init_task which is already exported. Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-3-cb3ec8711a6a@kernel.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04rseq: Switch to TIF_RSEQ if supportedThomas Gleixner-2/+8
TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is suboptimal especially with the RSEQ fast path depending on it, but not really handling it. Define a separate TIF_RSEQ in the generic TIF space and enable the full separation of fast and slow path for architectures which utilize that. That avoids the hassle with invocations of resume_user_mode_work() from hypervisors, which clear TIF_NOTIFY_RESUME. It makes the therefore required re-evaluation at the end of vcpu_run() a NOOP on architectures which utilize the generic TIF space and have a separate TIF_RSEQ. The hypervisor TIF handling does not include the separate TIF_RSEQ as there is no point in doing so. The guest does neither know nor care about the VMM host applications RSEQ state. That state is only relevant when the ioctl() returns to user space. The fastpath implementation still utilizes TIF_NOTIFY_RESUME for failure handling, but this only happens within exit_to_user_mode_loop(), so arguably the hypervisor ioctl() code is long done when this happens. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.903622031@linutronix.de
2025-11-04rseq: Switch to fast path processing on exit to userThomas Gleixner-9/+25
Now that all bits and pieces are in place, hook the RSEQ handling fast path function into exit_to_user_mode_prepare() after the TIF work bits have been handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised and the caller needs to take another turn through the TIF handling slow path. This only works for architectures which use the generic entry code. Architectures who still have their own incomplete hacks are not supported and won't be. This results in the following improvements: Kernel build Before After Reduction exit to user 80692981 80514451 signal checks: 32581 121 99% slowpath runs: 1201408 1.49% 198 0.00% 100% fastpath runs: 675941 0.84% N/A id updates: 1233989 1.53% 50541 0.06% 96% cs checks: 1125366 1.39% 0 0.00% 100% cs cleared: 1125366 100% 0 100% cs fixup: 0 0% 0 RSEQ selftests Before After Reduction exit to user: 386281778 387373750 signal checks: 35661203 0 100% slowpath runs: 140542396 36.38% 100 0.00% 100% fastpath runs: 9509789 2.51% N/A id updates: 176203599 45.62% 9087994 2.35% 95% cs checks: 175587856 45.46% 4728394 1.22% 98% cs cleared: 172359544 98.16% 1319307 27.90% 99% cs fixup: 3228312 1.84% 3409087 72.10% The 'cs cleared' and 'cs fixup' percentages are not relative to the exit to user invocations, they are relative to the actual 'cs check' invocations. While some of this could have been avoided in the original code, like the obvious clearing of CS when it's already clear, the main problem of going through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ notify handler is invoked more than once before going out to user space. Doing this once when everything has stabilized is the only solution to avoid this. The initial attempt to completely decouple it from the TIF work turned out to be suboptimal for workloads, which do a lot of quick and short system calls. Even if the fast path decision is only 4 instructions (including a conditional branch), this adds up quickly and becomes measurable when the rate for actually having to handle rseq is in the low single digit percentage range of user/kernel transitions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.701201365@linutronix.de
2025-11-04rseq: Implement fast path for exit to userThomas Gleixner-0/+2
Implement the actual logic for handling RSEQ updates in a fast path after handling the TIF work and at the point where the task is actually returning to user space. This is the right point to do that because at this point the CPU and the MM CID are stable and cannot longer change due to yet another reschedule. That happens when the task is handling it via TIF_NOTIFY_RESUME in resume_user_mode_work(), which is invoked from the exit to user mode work loop. The function is invoked after the TIF work is handled and runs with interrupts disabled, which means it cannot resolve page faults. It therefore disables page faults and in case the access to the user space memory faults, it: - notes the fail in the event struct - raises TIF_NOTIFY_RESUME - returns false to the caller The caller has to go back to the TIF work, which runs with interrupts enabled and therefore can resolve the page faults. This happens mostly on fork() when the memory is marked COW. If the user memory inspection finds invalid data, the function returns false as well and sets the fatal flag in the event struct along with TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag and terminate the task with SIGSEGV as documented. The initial decision to invoke any of this is based on one flags in the event struct: @sched_switch. The decision is in pseudo ASM: load tsk::event::sched_switch jnz inspect_user_space mov $0, tsk::event::events ... leave So for the common case where the task was not scheduled out, this really boils down to three instructions before going out if the compiler is not completely stupid (and yes, some of them are). If the condition is true, then it checks, whether CPU ID or MM CID have changed. If so, then the CPU/MM IDs have to be updated and are thereby cached for the next round. The update unconditionally retrieves the user space critical section address to spare another user*begin/end() pair. If that's not zero and tsk::event::user_irq is set, then the critical section is analyzed and acted upon. If either zero or the entry came via syscall the critical section analysis is skipped. If the comparison is false then the critical section has to be analyzed because the event flag is then only true when entry from user was by interrupt. This is provided without the actual hookup to let reviewers focus on the implementation details. The hookup happens in the next step. Note: As with quite some other optimizations this depends on the generic entry infrastructure and is not enabled to be sucked into random architecture implementations. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.638929615@linutronix.de
2025-11-04rseq: Optimize event settingThomas Gleixner-3/+11
After removing the various condition bits earlier it turns out that one extra information is needed to avoid setting event::sched_switch and TIF_NOTIFY_RESUME unconditionally on every context switch. The update of the RSEQ user space memory is only required, when either the task was interrupted in user space and schedules or the CPU or MM CID changes in schedule() independent of the entry mode Right now only the interrupt from user information is available. Add an event flag, which is set when the CPU or MM CID or both change. Evaluate this event in the scheduler to decide whether the sched_switch event and the TIF bit need to be set. It's an extra conditional in context_switch(), but the downside of unconditionally handling RSEQ after a context switch to user is way more significant. The utilized boolean logic minimizes this to a single conditional branch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.578058898@linutronix.de
2025-11-04rseq: Rework the TIF_NOTIFY handlerThomas Gleixner-43/+33
Replace the whole logic with a new implementation, which is shared with signal delivery and the upcoming exit fast path. Contrary to the original implementation, this ignores invocations from KVM/IO-uring, which invoke resume_user_mode_work() with the @regs argument set to NULL. The original implementation updated the CPU/Node/MM CID fields, but that was just a side effect, which was addressing the problem that this invocation cleared TIF_NOTIFY_RESUME, which in turn could cause an update on return to user space to be lost. This problem has been addressed differently, so that it's not longer required to do that update before entering the guest. That might be considered a user visible change, when the hosts thread TLS memory is mapped into the guest, but as this was never intentionally supported, this abuse of kernel internal implementation details is not considered an ABI break. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.517640811@linutronix.de
2025-11-04rseq: Separate the signal delivery pathThomas Gleixner-8/+22
Completely separate the signal delivery path from the notify handler as they have different semantics versus the event handling. The signal delivery only needs to ensure that the interrupted user context was not in a critical section or the section is aborted before it switches to the signal frame context. The signal frame context does not have the original instruction pointer anymore, so that can't be handled on exit to user space. No point in updating the CPU/CID ids as they might change again before the task returns to user space for real. The fast path optimization, which checks for the 'entry from user via interrupt' condition is only available for architectures which use the generic entry code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.455429038@linutronix.de
2025-11-04rseq: Provide and use rseq_set_ids()Thomas Gleixner-186/+50
Provide a new and straight forward implementation to set the IDs (CPU ID, Node ID and MM CID), which can be later inlined into the fast path. It does all operations in one scoped_user_rw_access() section and retrieves also the critical section member (rseq::cs_rseq) from user space to avoid another user..begin/end() pair. This is in preparation for optimizing the fast path to avoid extra work when not required. On rseq registration set the CPU ID fields to RSEQ_CPU_ID_UNINITIALIZED and node and MM CID to zero. That's the same as the kernel internal reset values. That makes the debug validation in the exit code work correctly on the first exit to user space. Use it to replace the whole related zoo in rseq.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.393972266@linutronix.de
2025-11-04rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=yThomas Gleixner-2/+8
Make the syscall exit debug mechanism available via the static branch on architectures which utilize the generic entry code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.333440475@linutronix.de
2025-11-04rseq: Replace the original debug implementationThomas Gleixner-69/+12
Just utilize the new infrastructure and put the original one to rest. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.212510692@linutronix.de
2025-11-04rseq: Provide and use rseq_update_user_cs()Thomas Gleixner-171/+75
Provide a straight forward implementation to check for and eventually clear/fixup critical sections in user space. The non-debug version does only the minimal sanity checks and aims for efficiency. There are two attack vectors, which are checked for: 1) An abort IP which is in the kernel address space. That would cause at least x86 to return to kernel space via IRET. 2) A rogue critical section descriptor with an abort IP pointing to some arbitrary address, which is not preceded by the RSEQ signature. If the section descriptors are invalid then the resulting misbehaviour of the user space application is not the kernels problem. The kernel provides a run-time switchable debug slow path, which implements the full zoo of checks including termination of the task when one of the gazillion conditions is not met. Replace the zoo in rseq.c with it and invoke it from the TIF_NOTIFY_RESUME handler. Move the remainders into the CONFIG_DEBUG_RSEQ section, which will be replaced and removed in a subsequent step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.151465632@linutronix.de
2025-11-04rseq: Provide static branch for runtime debuggingThomas Gleixner-4/+69
Config based debug is rarely turned on and is not available easily when things go wrong. Provide a static branch to allow permanent integration of debug mechanisms along with the usual toggles in Kconfig, command line and debugfs. Requested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.089270547@linutronix.de
2025-11-04rseq: Expose lightweight statistics in debugfsThomas Gleixner-7/+72
Analyzing the call frequency without actually using tracing is helpful for analysis of this infrastructure. The overhead is minimal as it just increments a per CPU counter associated to each operation. The debugfs readout provides a racy sum of all counters. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084307.027916598@linutronix.de
2025-11-04rseq: Provide tracepoint wrappers for inline codeThomas Gleixner-1/+18
Provide tracepoint wrappers for the upcoming RSEQ exit to user space inline fast path, so that the header can be safely included by code which defines actual trace points. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.967114316@linutronix.de
2025-11-04rseq: Cache CPU ID and MM CID valuesThomas Gleixner-0/+4
In preparation for rewriting RSEQ exit to user space handling provide storage to cache the CPU ID and MM CID values which were written to user space. That prepares for a quick check, which avoids the update when nothing changed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.841964081@linutronix.de
2025-11-04entry: Inline irqentry_enter/exit_from/to_user_mode()Thomas Gleixner-13/+0
There is no point to have this as a function which just inlines enter_from_user_mode(). The function call overhead is larger than the function itself. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.715309918@linutronix.de
2025-11-04entry: Remove syscall_enter_from_user_mode_prepare()Thomas Gleixner-8/+0
Open code the only user in the x86 syscall code and reduce the zoo of functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.652839989@linutronix.de
2025-11-04rseq: Introduce struct rseq_dataThomas Gleixner-35/+34
In preparation for a major rewrite of this code, provide a data structure for rseq management. Put all the rseq related data into it (except for the debug part), which allows to simplify fork/execve by using memset() and memcpy() instead of adding new fields to initialize over and over. Create a storage struct for event management as well and put the sched_switch event and a indicator for RSEQ on a task into it as a start. That uses a union, which allows to mask and clear the whole lot efficiently. The indicators are explicitly not a bit field. Bit fields generate abysmal code. The boolean members are defined as u8 as that actually guarantees that it fits. There seem to be strange architecture ABIs which need more than 8 bits for a boolean. The has_rseq member is redundant vs. task::rseq, but it turns out that boolean operations and quick checks on the union generate better code than fiddling with separate entities and data types. This struct will be extended over time to carry more information. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.527086690@linutronix.de
2025-11-04rseq: Avoid CPU/MM CID updates when no event pendingThomas Gleixner-5/+6
There is no need to update these values unconditionally if there is no event pending. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251027084306.462964916@linutronix.de
2025-11-04rseq, virt: Retrigger RSEQ after vcpu_run()Thomas Gleixner-37/+41
Hypervisors invoke resume_user_mode_work() before entering the guest, which clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user space context available to them, so the rseq notify handler skips inspecting the critical section, but updates the CPU/MM CID values unconditionally so that the eventual pending rseq event is not lost on the way to user space. This is a pointless exercise as the task might be rescheduled before actually returning to user space and it creates unnecessary work in the vcpu_run() loops. It's way more efficient to ignore that invocation based on @regs == NULL and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the vcpu_run() loop before returning from the ioctl(). This ensures that a pending RSEQ update is not lost and the IDs are updated before returning to user space. Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into a NOOP. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20251027084306.399495855@linutronix.de