summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorLines
2025-11-10tracing: Report wrong dynamic event commandMasami Hiramatsu (Google)-2/+9
Report wrong dynamic event type in the command via error_log. ----- # echo "z hoge" > /sys/kernel/tracing/dynamic_events sh: write error: Invalid argument # cat /sys/kernel/tracing/error_log [ 22.977022] dynevent: error: No matching dynamic event type Command: z hoge ^ ----- Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/176278970056.343441.10528135217342926645.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Use switch statement instead of ifs in set_tracer_flag()Steven Rostedt-15/+23
The "mask" passed in to set_trace_flag() has a single bit set. The function then checks if the mask is equal to one of the option masks and performs the appropriate function associated to that option. Instead of having a bunch of "if ()" statement, use a "switch ()" statement instead to make it cleaner and a bit more optimal. No function changes. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251106003501.890298562@kernel.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Exit out immediately after update_marker_trace()Steven Rostedt-1/+4
The call to update_marker_trace() in set_tracer_flag() performs the update to the tr->trace_flags. There's no reason to perform it again after it is called. Return immediately instead. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251106003501.726406870@kernel.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Have add_tracer_options() error pass up to callersSteven Rostedt-21/+34
The function add_tracer_options() can fail, but currently it is ignored. Pass the status of add_tracer_options() up to adding a new tracer as well as when an instance is created. Have the instance creation fail if the add_tracer_options() fail. Only print a warning for the top level instance, like it does with other failures. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251105161935.375299297@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Remove dummy options and flagsSteven Rostedt-32/+16
When a tracer does not define their own flags, dummy options and flags are used so that the values are always valid. There's not that many locations that reference these values so having dummy versions just complicates the code. Remove the dummy values and just check for NULL when appropriate. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251105161935.206093132@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Hide __NR_utimensat and _NR_mq_timedsend when not definedSteven Rostedt-0/+4
Some architectures (riscv-32) do not define __NR_utimensat and _NR_mq_timedsend, and fails to build when they are used. Hide them in "ifdef"s. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251104205310.00a1db9a@batman.local.home Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202511031239.ZigDcWzY-lkp@intel.com/ Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10bpf: Export necessary symbols for modules with struct_opsD. Wythe-0/+3
Exports three necessary symbols for implementing struct_ops with tristate subsystem. To hold or release refcnt of struct_ops refcnt by inline funcs bpf_try_module_get and bpf_module_put which use bpf_struct_ops_get(put) conditionally. And to copy obj name from one to the other with effective checks by bpf_obj_name_cpy. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251107035632.115950-2-alibuda@linux.alibaba.com
2025-11-10bpf: Unclone skb head on bpf_dynptr_write to skb metadataJakub Sitnicki-4/+2
Currently bpf_dynptr_from_skb_meta() marks the dynptr as read-only when the skb is cloned, preventing writes to metadata. Remove this restriction and unclone the skb head on bpf_dynptr_write() to metadata, now that the metadata is preserved during uncloning. This makes metadata dynptr consistent with skb dynptr, allowing writes regardless of whether the skb is cloned. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-3-5ceb08a9b37b@cloudflare.com
2025-11-10workqueue: Remove unused assert_rcu_or_wq_mutex_or_pool_mutexzhang jiao-6/+0
assert_rcu_or_wq_mutex_or_pool_mutex is never referenced in the code. Just remove it. Signed-off-by: zhang jiao <zhangjiao2@cmss.chinamobile.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-10printk_ringbuffer: Create a helper function to decide whether more space is ↵Petr Mladek-4/+28
needed The decision whether some more space is needed is tricky in the printk ring buffer code: 1. The given lpos values might overflow. A subtraction must be used instead of a simple "lower than" check. 2. Another CPU might reuse the space in the mean time. It can be detected when the subtraction is bigger than DATA_SIZE(data_ring). 3. There is exactly enough space when the result of the subtraction is zero. But more space is needed when the result is exactly DATA_SIZE(data_ring). Add a helper function to make sure that the check is done correctly in all situations. Also it helps to make the code consistent and better documented. Suggested-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/r/87tsz7iea2.fsf@jogness.linutronix.de Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20251107194720.1231457-3-pmladek@suse.com [pmladek@suse.com: Updated wording as suggested by John] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-11-10printk_ringbuffer: Fix check of valid data size when blk_lpos overflowsPetr Mladek-3/+6
The commit 67e1b0052f6bb8 ("printk_ringbuffer: don't needlessly wrap data blocks around") allows to use the last 4 bytes of the ring buffer. But the check for the @data_size was not properly updated in get_data(). It fails when "blk_lpos->next" overflows to "0". In this case: + is_blk_wrapped(data_ring, blk_lpos->begin, blk_lpos->next) returns "false" because it checks "blk_lpos->next - 1". + "blk_lpos->begin < blk_lpos->next" fails because "blk_lpos->next" is already 0. + is_blk_wrapped(data_ring, blk_lpos->begin + DATA_SIZE(data_ring), blk_lpos->next) returns "false" because "begin_lpos" is from the next wrap but "next_lpos - 1" is from the previous one. As a result, get_data() triggers the WARN_ON_ONCE() for "Illegal block description", for example: [ 216.317316][ T7652] loop0: detected capacity change from 0 to 16 ** 1 printk messages dropped ** [ 216.327750][ T7652] ------------[ cut here ]------------ [ 216.327789][ T7652] WARNING: kernel/printk/printk_ringbuffer.c:1278 at get_data+0x48a/0x840, CPU#1: syz.0.585/7652 [ 216.327848][ T7652] Modules linked in: [ 216.327907][ T7652] CPU: 1 UID: 0 PID: 7652 Comm: syz.0.585 Not tainted syzkaller #0 PREEMPT(full) [ 216.327933][ T7652] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025 [ 216.327953][ T7652] RIP: 0010:get_data+0x48a/0x840 [ 216.327986][ T7652] Code: 83 c4 f8 48 b8 00 00 00 00 00 fc ff df 41 0f b6 04 07 84 c0 0f 85 ee 01 00 00 44 89 65 00 49 83 c5 08 eb 13 e8 a7 19 1f 00 90 <0f> 0b 90 eb 05 e8 9c 19 1f 00 45 31 ed 4c 89 e8 48 83 c4 28 5b 41 [ 216.328007][ T7652] RSP: 0018:ffffc900035170e0 EFLAGS: 00010293 [ 216.328029][ T7652] RAX: ffffffff81a1eee9 RBX: 00003fffffffffff RCX: ffff888033255b80 [ 216.328048][ T7652] RDX: 0000000000000000 RSI: 00003fffffffffff RDI: 0000000000000000 [ 216.328063][ T7652] RBP: 0000000000000012 R08: 0000000000000e55 R09: 000000325e213cc7 [ 216.328079][ T7652] R10: 000000325e213cc7 R11: 00001de4c2000037 R12: 0000000000000012 [ 216.328095][ T7652] R13: 0000000000000000 R14: ffffc90003517228 R15: 1ffffffff1bca646 [ 216.328111][ T7652] FS: 00007f44eb8da6c0(0000) GS:ffff888125fda000(0000) knlGS:0000000000000000 [ 216.328131][ T7652] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 216.328147][ T7652] CR2: 00007f44ea9722e0 CR3: 0000000066344000 CR4: 00000000003526f0 [ 216.328168][ T7652] Call Trace: [ 216.328178][ T7652] <TASK> [ 216.328199][ T7652] _prb_read_valid+0x672/0xa90 [ 216.328328][ T7652] ? desc_read+0x1b8/0x3f0 [ 216.328381][ T7652] ? __pfx__prb_read_valid+0x10/0x10 [ 216.328422][ T7652] ? panic_on_this_cpu+0x32/0x40 [ 216.328450][ T7652] prb_read_valid+0x3c/0x60 [ 216.328482][ T7652] printk_get_next_message+0x15c/0x7b0 [ 216.328526][ T7652] ? __pfx_printk_get_next_message+0x10/0x10 [ 216.328561][ T7652] ? __lock_acquire+0xab9/0xd20 [ 216.328595][ T7652] ? console_flush_all+0x131/0xb10 [ 216.328621][ T7652] ? console_flush_all+0x478/0xb10 [ 216.328648][ T7652] console_flush_all+0x4cc/0xb10 [ 216.328673][ T7652] ? console_flush_all+0x131/0xb10 [ 216.328704][ T7652] ? __pfx_console_flush_all+0x10/0x10 [ 216.328748][ T7652] ? is_printk_cpu_sync_owner+0x32/0x40 [ 216.328781][ T7652] console_unlock+0xbb/0x190 [ 216.328815][ T7652] ? __pfx___down_trylock_console_sem+0x10/0x10 [ 216.328853][ T7652] ? __pfx_console_unlock+0x10/0x10 [ 216.328899][ T7652] vprintk_emit+0x4c5/0x590 [ 216.328935][ T7652] ? __pfx_vprintk_emit+0x10/0x10 [ 216.328993][ T7652] _printk+0xcf/0x120 [ 216.329028][ T7652] ? __pfx__printk+0x10/0x10 [ 216.329051][ T7652] ? kernfs_get+0x5a/0x90 [ 216.329090][ T7652] _erofs_printk+0x349/0x410 [ 216.329130][ T7652] ? __pfx__erofs_printk+0x10/0x10 [ 216.329161][ T7652] ? __raw_spin_lock_init+0x45/0x100 [ 216.329186][ T7652] ? __init_swait_queue_head+0xa9/0x150 [ 216.329231][ T7652] erofs_fc_fill_super+0x1591/0x1b20 [ 216.329285][ T7652] ? __pfx_erofs_fc_fill_super+0x10/0x10 [ 216.329324][ T7652] ? sb_set_blocksize+0x104/0x180 [ 216.329356][ T7652] ? setup_bdev_super+0x4c1/0x5b0 [ 216.329385][ T7652] get_tree_bdev_flags+0x40e/0x4d0 [ 216.329410][ T7652] ? __pfx_erofs_fc_fill_super+0x10/0x10 [ 216.329444][ T7652] ? __pfx_get_tree_bdev_flags+0x10/0x10 [ 216.329483][ T7652] vfs_get_tree+0x92/0x2b0 [ 216.329512][ T7652] do_new_mount+0x302/0xa10 [ 216.329537][ T7652] ? apparmor_capable+0x137/0x1b0 [ 216.329576][ T7652] ? __pfx_do_new_mount+0x10/0x10 [ 216.329605][ T7652] ? ns_capable+0x8a/0xf0 [ 216.329637][ T7652] ? kmem_cache_free+0x19b/0x690 [ 216.329682][ T7652] __se_sys_mount+0x313/0x410 [ 216.329717][ T7652] ? __pfx___se_sys_mount+0x10/0x10 [ 216.329836][ T7652] ? do_syscall_64+0xbe/0xfa0 [ 216.329869][ T7652] ? __x64_sys_mount+0x20/0xc0 [ 216.329901][ T7652] do_syscall_64+0xfa/0xfa0 [ 216.329932][ T7652] ? lockdep_hardirqs_on+0x9c/0x150 [ 216.329964][ T7652] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 216.329988][ T7652] ? clear_bhb_loop+0x60/0xb0 [ 216.330017][ T7652] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 216.330040][ T7652] RIP: 0033:0x7f44ea99076a [ 216.330080][ T7652] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb a6 e8 de 1a 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 [ 216.330100][ T7652] RSP: 002b:00007f44eb8d9e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [ 216.330128][ T7652] RAX: ffffffffffffffda RBX: 00007f44eb8d9ef0 RCX: 00007f44ea99076a [ 216.330146][ T7652] RDX: 0000200000000180 RSI: 00002000000001c0 RDI: 00007f44eb8d9eb0 [ 216.330164][ T7652] RBP: 0000200000000180 R08: 00007f44eb8d9ef0 R09: 0000000000000000 [ 216.330181][ T7652] R10: 0000000000000000 R11: 0000000000000246 R12: 00002000000001c0 [ 216.330196][ T7652] R13: 00007f44eb8d9eb0 R14: 00000000000001a1 R15: 0000200000000080 [ 216.330233][ T7652] </TASK> Solve the problem by moving and fixing the sanity check. The problematic if-else-if-else code will just distinguish three basic scenarios: "regular" vs. "wrapped" vs. "too many times wrapped" block. The new sanity check is more precise. A valid "data_size" must be lower than half of the data buffer size. Also it must not be zero at this stage. It allows to catch problematic "data_size" even for wrapped blocks. Closes: https://lore.kernel.org/all/69096836.a70a0220.88fb8.0006.GAE@google.com/ Closes: https://lore.kernel.org/all/69078fb6.050a0220.29fc44.0029.GAE@google.com/ Fixes: 67e1b0052f6bb82 ("printk_ringbuffer: don't needlessly wrap data blocks around") Reviewed-by: John Ogness <john.ogness@linutronix.de> Tested-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20251107194720.1231457-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-11-10ns: add asserts for active refcount underflowChristian Brauner-4/+14
Add a few more assert to detect active reference count underflows. Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-6-ae8a4ad5a3b3@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10ns: handle setns(pidfd, ...) cleanlyChristian Brauner-17/+12
The setns() system call supports: (1) namespace file descriptors (nsfd) (2) process file descriptors (pidfd) When using nsfds the namespaces will remain active because they are pinned by the vfs. However, when pidfds are used things are more complicated. When the target task exits and passes through exit_nsproxy_namespaces() or is reaped and thus also passes through exit_cred_namespaces() after the setns()'ing task has called prepare_nsset() but before the active reference count of the set of namespaces it wants to setns() to might have been dropped already: P1 P2 pid_p1 = clone(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS) pidfd = pidfd_open(pid_p1) setns(pidfd, CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS) prepare_nsset() exit(0) // ns->__ns_active_ref == 1 // parent_ns->__ns_active_ref == 1 -> exit_nsproxy_namespaces() -> exit_cred_namespaces() // ns_active_ref_put() will also put // the reference on the owner of the // namespace. If the only reason the // owning namespace was alive was // because it was a parent of @ns // it's active reference count now goes // to zero... -------------------------------- // | // ns->__ns_active_ref == 0 | // parent_ns->__ns_active_ref == 0 | | commit_nsset() -----------------> // If setns() // now manages to install the namespaces // it will call ns_active_ref_get() // on them thus bumping the active reference // count from zero again but without also // taking the required reference on the owner. // Thus we get: // // ns->__ns_active_ref == 1 // parent_ns->__ns_active_ref == 0 When later someone does ns_active_ref_put() on @ns it will underflow parent_ns->__ns_active_ref leading to a splat from our asserts thinking there are still active references when in fact the counter just underflowed. So resurrect the ownership chain if necessary as well. If the caller succeeded to grab passive references to the set of namespaces the setns() should simply succeed even if the target task exists or gets reaped in the meantime and thus has dropped all active references to its namespaces. The race is rare and can only be triggered when using pidfs to setns() to namespaces. Also note that active reference on initial namespaces are nops. Since we now always handle parent references directly we can drop ns_ref_active_get_owner() when adding a namespace to a namespace tree. This is now all handled uniformly in the places where the new namespaces actually become active. Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-5-ae8a4ad5a3b3@kernel.org Fixes: 3c9820d5c64a ("ns: add active reference count") Reported-by: syzbot+1957b26299cf3ff7890c@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10ns: return EFAULT on put_user() errorChristian Brauner-2/+2
Don't return EINVAL, return EFAULT just like we do in other system calls. Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-4-ae8a4ad5a3b3@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10ns: make sure reference are dropped outside of rcu lockChristian Brauner-9/+23
The mount namespace may in fact sleep when putting the last passive reference so we need to drop the namespace reference outside of the rcu read lock. Do this by delaying the put until the next iteration where we've already moved on to the next namespace and legitimized it. Once we drop the rcu read lock to call put_user() we will also drop the reference to the previous namespace in the tree. Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-3-ae8a4ad5a3b3@kernel.org Fixes: 76b6f5dfb3fd ("nstree: add listns()") Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10ns: don't increment or decrement initial namespacesChristian Brauner-0/+6
There's no need to bump the active reference counts of initial namespaces as they're always active and can simply remain at 1. Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-2-ae8a4ad5a3b3@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10ns: don't skip active reference count initializationChristian Brauner-5/+4
Don't skip active reference count initialization for initial namespaces. Doing this will break network namespace active reference counting. Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-1-ae8a4ad5a3b3@kernel.org Fixes: 3a18f809184b ("ns: add active reference count") Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-09kho: warn and exit when unpreserved page wasn't preservedPratyush Yadav-4/+4
Calling __kho_unpreserve() on a pair of (pfn, end_pfn) that wasn't preserved is a bug. Currently, if that is done, the physxa or bits can be NULL. This results in a soft lockup since a NULL physxa or bits results in redoing the loop without ever making any progress. Return when physxa or bits are not found, but WARN first to loudly indicate invalid behaviour. Link: https://lkml.kernel.org/r/20251103180235.71409-3-pratyush@kernel.org Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09kho: fix unpreservation of higher-order vmalloc preservationsPratyush Yadav-3/+4
kho_vmalloc_unpreserve_chunk() calls __kho_unpreserve() with end_pfn as pfn + 1. This happens to work for 0-order pages, but leaks higher order pages. For example, say order 2 pages back the allocation. During preservation, they get preserved in the order 2 bitmaps, but kho_vmalloc_unpreserve_chunk() would try to unpreserve them from the order 0 bitmaps, which should not have these bits set anyway, leaving the order 2 bitmaps untouched. This results in the pages being carried over to the next kernel. Nothing will free those pages in the next boot, leaking them. Fix this by taking the order into account when calculating the end PFN for __kho_unpreserve(). Link: https://lkml.kernel.org/r/20251103180235.71409-2-pratyush@kernel.org Fixes: a667300bd53f ("kho: add support for preserving vmalloc allocations") Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09kho: fix out-of-bounds access of vmalloc chunkPratyush Yadav-2/+2
The list of pages in a vmalloc chunk is NULL-terminated. So when looping through the pages in a vmalloc chunk, both kho_restore_vmalloc() and kho_vmalloc_unpreserve_chunk() rightly make sure to stop when encountering a NULL page. But when the chunk is full, the loops do not stop and go past the bounds of chunk->phys, resulting in out-of-bounds memory access, and possibly the restoration or unpreservation of an invalid page. Fix this by making sure the processing of chunk stops at the end of the array. Link: https://lkml.kernel.org/r/20251103110159.8399-1-pratyush@kernel.org Fixes: a667300bd53f ("kho: add support for preserving vmalloc allocations") Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09gcov: add support for GCC 15Peter Oberparleiter-1/+3
Using gcov on kernels compiled with GCC 15 results in truncated 16-byte long .gcda files with no usable data. To fix this, update GCOV_COUNTERS to match the value defined by GCC 15. Tested with GCC 14.3.0 and GCC 15.2.0. Link: https://lkml.kernel.org/r/20251028115125.1319410-1-oberpar@linux.ibm.com Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reported-by: Matthieu Baerts <matttbe@kernel.org> Closes: https://github.com/linux-test-project/lcov/issues/445 Tested-by: Matthieu Baerts <matttbe@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09kho: allocate metadata directly from the buddy allocatorPasha Tatashin-3/+3
KHO allocates metadata for its preserved memory map using the slab allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory. A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock. When booting via KHO, the memblock allocator is restricted to a "scratch area", forcing the KFENCE pool to be allocated within it. This creates a conflict, as the scratch area is expected to be ephemeral and overwriteable by a subsequent kexec. If KHO metadata is placed in this KFENCE pool, it leads to memory corruption when the next kernel is loaded. To fix this, modify KHO to allocate its metadata directly from the buddy allocator instead of slab. Link: https://lkml.kernel.org/r/20251021000852.2924827-4-pasha.tatashin@soleen.com Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Matlack <dmatlack@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09kho: increase metadata bitmap size to PAGE_SIZEPasha Tatashin-10/+11
KHO memory preservation metadata is preserved in 512 byte chunks which requires their allocation from slab allocator. Slabs are not safe to be used with KHO because of kfence, and because partial slabs may lead leaks to the next kernel. Change the size to be PAGE_SIZE. The kfence specifically may cause memory corruption, where it randomly provides slab objects that can be within the scratch area. The reason for that is that kfence allocates its objects prior to KHO scratch is marked as CMA region. While this change could potentially increase metadata overhead on systems with sparsely preserved memory, this is being mitigated by ongoing work to reduce sparseness during preservation via 1G guest pages. Furthermore, this change aligns with future work on a stateless KHO, which will also use page-sized bitmaps for its radix tree metadata. Link: https://lkml.kernel.org/r/20251021000852.2924827-3-pasha.tatashin@soleen.com Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09kho: warn and fail on metadata or preserved memory in scratch areaPasha Tatashin-19/+93
Patch series "KHO: kfence + KHO memory corruption fix", v3. This series fixes a memory corruption bug in KHO that occurs when KFENCE is enabled. The root cause is that KHO metadata, allocated via kzalloc(), can be randomly serviced by kfence_alloc(). When a kernel boots via KHO, the early memblock allocator is restricted to a "scratch area". This forces the KFENCE pool to be allocated within this scratch area, creating a conflict. If KHO metadata is subsequently placed in this pool, it gets corrupted during the next kexec operation. Google is using KHO and have had obscure crashes due to this memory corruption, with stacks all over the place. I would prefer this fix to be properly backported to stable so we can also automatically consume it once we switch to the upstream KHO. Patch 1/3 introduces a debug-only feature (CONFIG_KEXEC_HANDOVER_DEBUG) that adds checks to detect and fail any operation that attempts to place KHO metadata or preserved memory within the scratch area. This serves as a validation and diagnostic tool to confirm the problem without affecting production builds. Patch 2/3 Increases bitmap to PAGE_SIZE, so buddy allocator can be used. Patch 3/3 Provides the fix by modifying KHO to allocate its metadata directly from the buddy allocator instead of slab. This bypasses the KFENCE interception entirely. This patch (of 3): It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption. Add checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected. [rppt@kernel.org: fix KEXEC_HANDOVER_DEBUG Kconfig dependency] Link: https://lkml.kernel.org/r/aQHUyyFtiNZhx8jo@kernel.org [pasha.tatashin@soleen.com: build fix] Link: https://lkml.kernel.org/r/CA+CK2bBnorfsTymKtv4rKvqGBHs=y=MjEMMRg_tE-RME6n-zUw@mail.gmail.com Link: https://lkml.kernel.org/r/20251021000852.2924827-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20251021000852.2924827-2-pasha.tatashin@soleen.com Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Mike Rapoport <rppt@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-08Merge tag 'sched-urgent-2025-11-08' of ↵Linus Torvalds-10/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a group-throttling bug in the fair scheduler" * tag 'sched-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
2025-11-08Merge tag 'perf-urgent-2025-11-08' of ↵Linus Torvalds-5/+15
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix a system hang caused by cpu-clock events deadlock" * tag 'perf-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix system hang caused by cpu-clock usage
2025-11-08Merge tag 'locking-urgent-2025-11-08' of ↵Linus Torvalds-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix (well, cut in half) a futex performance regression on PowerPC" * tag 'locking-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Optimize per-cpu reference counting
2025-11-07audit: merge loops in __audit_inode_child()Ricardo Robaina-24/+19
Whenever there's audit context, __audit_inode_child() gets called numerous times, which can lead to high latency in scenarios that create too many sysfs/debugfs entries at once, for instance, upon device_add_disk() invocation. # uname -r 6.18.0-rc2+ # auditctl -a always,exit -F path=/tmp -k foo # time insmod loop max_loop=1000 real 0m46.676s user 0m0.000s sys 0m46.405s # perf record -a insmod loop max_loop=1000 # perf report --stdio |grep __audit_inode_child 32.73% insmod [kernel.kallsyms] [k] __audit_inode_child __audit_inode_child() searches for both the parent and the child in two different loops that iterate over the same list. This process can be optimized by merging these into a single loop, without changing the function behavior or affecting the code's readability. This patch merges the two loops that walk through the list context->names_list into a single loop. This optimization resulted in around 51% performance enhancement for the benchmark. # uname -r 6.18.0-rc2-enhancedv3+ # auditctl -a always,exit -F path=/tmp -k foo # time insmod loop max_loop=1000 real 0m22.899s user 0m0.001s sys 0m22.652s Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-11-07audit: Use kzalloc() instead of kmalloc()/memset() in audit_krule_to_data()Gongwei Li-2/+1
Replace kmalloc+memset by kzalloc for better readability and simplicity. This addresses the warning below: WARNING: kzalloc should be used for data, instead of kmalloc/memset Signed-off-by: Gongwei Li <ligongwei@kylinos.cn> [PM: subject and description tweaks] Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-11-07Merge tag 'trace-v6.18-rc4' of ↵Linus Torvalds-2/+8
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Check for reader catching up in ring_buffer_map_get_reader() If the reader catches up to the writer in the memory mapped ring buffer then calling rb_get_reader_page() will return NULL as there's no pages left. But this isn't checked for before calling rb_get_reader_page() and the return of NULL causes a warning. If it is detected that the reader caught up to the writer, then simply exit the routine - Fix memory leak in histogram create_field_var() The couple of the error paths in create_field_var() did not properly clean up what was allocated. Make sure everything is freed properly on error - Fix help message of tools latency_collector The help message incorrectly stated that "-t" was the same as "--threads" whereas "--threads" is actually represented by "-e" * tag 'trace-v6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/tools: Fix incorrcet short option in usage text for --threads tracing: Fix memory leaks in create_field_var() ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up
2025-11-07PM: hibernate: Fix style issues in save_compressed_image()Mario Limonciello (AMD)-2/+3
Address two issues indicated by checkpatch: - Trailing statements should be on next line. - Prefer 'unsigned int' to bare use of 'unsigned'. Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> [ rjw: Changelog edits ] Link: https://patch.msgid.link/20251106045158.3198061-4-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-07PM: hibernate: Use atomic64_t for compressed_size variableMario Limonciello (AMD)-5/+5
`compressed_size` can overflow, showing nonsensical values. Change from `atomic_t` to `atomic64_t` to prevent overflow. Fixes: a06c6f5d3cc9 ("PM: hibernate: Move to crypto APIs for LZO compression") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251105180506.137448-1-safinaskar@gmail.com/ Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Askar Safin <safinaskar@gmail.com> Cc: 6.9+ <stable@vger.kernel.org> # 6.9+ Link: https://patch.msgid.link/20251106045158.3198061-3-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-07PM: hibernate: Emit an error when image writing failsMario Limonciello (AMD)-4/+7
If image writing fails, a return code is passed up to the caller, but none of the callers log anything to the log and so the only record of it is the return code that userspace gets. Adjust the logging so that the image size and speed of writing is only emitted on success and if there is an error, it's saved to the logs. Fixes: a06c6f5d3cc9 ("PM: hibernate: Move to crypto APIs for LZO compression") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251105180506.137448-1-safinaskar@gmail.com/ Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Askar Safin <safinaskar@gmail.com> Cc: 6.9+ <stable@vger.kernel.org> # 6.9+ [ rjw: Added missing braces after "else", changelog edits ] Link: https://patch.msgid.link/20251106045158.3198061-2-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-07refscale: Do not disable interrupts for tests involving local_bh_enable()Paul E. McKenney-4/+10
Some kernel configurations prohibit invoking local_bh_enable() while interrupts are disabled. However, refscale disables interrupts to reduce OS noise during the tests, which results in splats. This commit therefore adds an ->enable_irqs flag to the ref_scale_ops structure, and refrains from disabling interrupts when that flag is set. This flag is set for the "bh" and "incpercpubh" scale_type module-parameter values. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07refscale: Add non-atomic per-CPU increment readersPaul E. McKenney-2/+153
This commit adds refscale readers based on READ_ONCE() and WRITE_ONCE() that are unprotected (can lose counts, "refscale.scale_type=incpercpu"), preempt-disabled ("refscale.scale_type=incpercpupreempt"), bh-disabled ("refscale.scale_type=incpercpubh"), and irq-disabled ("refscale.scale_type=incpercpuirqsave"). On my x86 laptop, these are about 4.3ns, 3.8ns, and 7.3ns per pair, respectively. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07refscale: Add this_cpu_inc() readersPaul E. McKenney-4/+32
This commit adds refscale readers based on this_cpu_inc() and this_cpu_inc() ("refscale.scale_type=percpuinc"). On my x86 laptop, these are about 4.5ns per pair. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07refscale: Add preempt_disable() readersPaul E. McKenney-1/+32
This commit adds refscale readers based on preempt_disable() and preempt_enable() ("refscale.scale_type=preempt"). On my x86 laptop, these are about 2.8ns. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07refscale: Add local_bh_disable() readersPaul E. McKenney-1/+33
This commit adds refscale readers based on local_bh_disable() and local_bh_enable() ("refscale.scale_type=bh"). On my x86 laptop, these are about 4.9ns. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07refscale: Add local_irq_disable() and local_irq_save() readersPaul E. McKenney-1/+65
This commit adds refscale readers based on local_irq_disable() and local_irq_enable() ("refscale.scale_type=irq") and on local_irq_save() and local_irq_restore ("refscale.scale_type=irqsave"). On my x86 laptop, these are about 2.8ns and 7.5ns per enable/disable pair, respectively. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07printk: nbcon: Allow unsafe write_atomic() for panicJohn Ogness-15/+32
There may be console drivers that have not yet figured out a way to implement safe atomic printing (->write_atomic() callback). These drivers could choose to only implement threaded printing (->write_thread() callback), but then it is guaranteed that _no_ output will be printed during panic. Not even attempted. As a result, developers may be tempted to implement unsafe ->write_atomic() callbacks and/or implement some sort of custom deferred printing trickery to try to make it work. This goes against the principle intention of the nbcon API as well as endangers other nbcon drivers that are doing things correctly (safely). As a compromise, allow nbcon drivers to implement unsafe ->write_atomic() callbacks by providing a new console flag CON_NBCON_ATOMIC_UNSAFE. When specified, the ->write_atomic() callback for that console will _only_ be called during the final "hope and pray" flush attempt at the end of a panic: nbcon_atomic_flush_unsafe(). Signed-off-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/lkml/b2qps3uywhmjaym4mht2wpxul4yqtuuayeoq4iv4k3zf5wdgh3@tocu6c7mj4lt Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/all/swdpckuwwlv3uiessmtnf2jwlx3jusw6u7fpk5iggqo4t2vdws@7rpjso4gr7qp/ [1] Link: https://lore.kernel.org/all/20251103-fix_netpoll_aa-v4-1-4cfecdf6da7c@debian.org/ [2] Link: https://patch.msgid.link/20251027161212.334219-2-john.ogness@linutronix.de [pmladek@suse.com: Fix build with rework/nbcon-in-kdb branch.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-11-07srcu: Add SRCU_READ_FLAVOR_FAST_UPDOWN CPP macroPaul E. McKenney-6/+18
This commit adds the SRCU_READ_FLAVOR_FAST_UPDOWN=0x8 macro and adjusts rcutorture to make use of it. In this commit, both SRCU_READ_FLAVOR_FAST=0x4 and the new SRCU_READ_FLAVOR_FAST_UPDOWN test SRCU-fast. When the SRCU-fast-updown is added, the new SRCU_READ_FLAVOR_FAST_UPDOWN macro will test it when passed to the rcutorture.reader_flavor module parameter. The old SRCU_READ_FLAVOR_FAST macro's value changed from 0x8 to 0x4. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07rcu: Mark diagnostic functions as notracePaul E. McKenney-5/+5
The rcu_lockdep_current_cpu_online(), rcu_read_lock_sched_held(), rcu_read_lock_held(), rcu_read_lock_bh_held(), rcu_read_lock_any_held() are used by tracing-related code paths, so putting traces on them is unlikely to make anyone happy. This commit therefore marks them all "notrace". Reported-by: Leon Hwang <leon.hwang@linux.dev> Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-06tracing: Fix memory leaks in create_field_var()Zilin Guan-2/+4
The function create_field_var() allocates memory for 'val' through create_hist_field() inside parse_atom(), and for 'var' through create_var(), which in turn allocates var->type and var->var.name internally. Simply calling kfree() to release these structures will result in memory leaks. Use destroy_hist_field() to properly free 'val', and explicitly release the memory of var->type and var->var.name before freeing 'var' itself. Link: https://patch.msgid.link/20251106120132.3639920-1-zilin@seu.edu.cn Fixes: 02205a6752f22 ("tracing: Add support for 'field variables'") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-06ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches upSteven Rostedt-0/+4
The function ring_buffer_map_get_reader() is a bit more strict than the other get reader functions, and except for certain situations the rb_get_reader_page() should not return NULL. If it does, it triggers a warning. This warning was triggering but after looking at why, it was because another acceptable situation was happening and it wasn't checked for. If the reader catches up to the writer and there's still data to be read on the reader page, then the rb_get_reader_page() will return NULL as there's no new page to get. In this situation, the reader page should not be updated and no warning should trigger. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Reported-by: syzbot+92a3745cea5ec6360309@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/690babec.050a0220.baf87.0064.GAE@google.com/ Link: https://lore.kernel.org/20251016132848.1b11bb37@gandalf.local.home Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-06bpf: Use kmalloc_nolock() in range treePuranjay Mohan-15/+6
The range tree uses bpf_mem_alloc() that is safe to be called from all contexts and uses a pre-allocated pool of memory to serve these allocations. Replace bpf_mem_alloc() with kmalloc_nolock() as it can be called safely from all contexts and is more scalable than bpf_mem_alloc(). Remove the migrate_disable/enable pairs as they were only needed for bpf_mem_alloc() as it does per-cpu operations, kmalloc_nolock() doesn't need this. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251106170608.4800-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-06cgroup: Fix sleeping from invalid context warning on PREEMPT_RTTejun Heo-1/+54
cgroup_task_dead() is called from finish_task_switch() which runs with preemption disabled and doesn't allow scheduling even on PREEMPT_RT. The function needs to acquire css_set_lock which is a regular spinlock that can sleep on RT kernels, leading to "sleeping function called from invalid context" warnings. css_set_lock is too large in scope to convert to a raw_spinlock. However, the unlinking operations don't need to run synchronously - they just need to complete after the task is done running. On PREEMPT_RT, defer the work through irq_work. While the work doesn't need to happen immediately, it can't be delayed indefinitely either as the dead task pins the cgroup and task_struct can be pinned indefinitely. Use the lazy version of irq_work to allow batching and lower impact while ensuring timely completion. v2: Use IRQ_WORK_INIT_LAZY instead of immediate irq_work and add explanation for why the work can't be delayed indefinitely (Sebastian Andrzej Siewior). Fixes: d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") Reported-by: Calvin Owens <calvin@wbinvd.org> Link: https://lore.kernel.org/r/20251104181114.489391-1-calvin@wbinvd.org Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-07tracing: tprobe-events: Fix to put tracepoint_user when disable the tprobeMasami Hiramatsu (Google)-0/+4
__unregister_trace_fprobe() checks tf->tuser to put it when removing tprobe. However, disable_trace_fprobe() does not use it and only calls unregister_fprobe(). Thus it forgets to disable tracepoint_user. If the trace_fprobe has tuser, put it for unregistering the tracepoint callbacks when disabling tprobe correctly. Link: https://lore.kernel.org/all/176244794466.155515.3971904050506100243.stgit@devnote2/ Fixes: 2867495dea86 ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>
2025-11-07tracing: tprobe-events: Fix to register tracepoint correctlyMasami Hiramatsu (Google)-1/+2
Since __tracepoint_user_init() calls tracepoint_user_register() without initializing tuser->tpoint with given tracpoint, it does not register tracepoint stub function as callback correctly, and tprobe does not work. Initializing tuser->tpoint correctly before tracepoint_user_register() so that it sets up tracepoint callback. I confirmed below example works fine again. echo "t sched_switch preempt prev_pid=prev->pid next_pid=next->pid" > /sys/kernel/tracing/dynamic_events echo 1 > /sys/kernel/tracing/events/tracepoints/sched_switch/enable cat /sys/kernel/tracing/trace_pipe Link: https://lore.kernel.org/all/176244793514.155515.6466348656998627773.stgit@devnote2/ Fixes: 2867495dea86 ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Reported-by: Beau Belgrave <beaub@linux.microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>
2025-11-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski-10/+22
Cross-merge networking fixes after downstream PR (net-6.18-rc5). Conflicts: drivers/net/wireless/ath/ath12k/mac.c 9222582ec524 ("Revert "wifi: ath12k: Fix missing station power save configuration"") 6917e268c433 ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon") https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net Adjacent changes: drivers/net/ethernet/intel/Kconfig b1d16f7c0063 ("libie: depend on DEBUG_FS when building LIBIE_FWLOG") 93f53db9f9dc ("ice: switch to Page Pool") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06futex: Optimize per-cpu reference countingPeter Zijlstra-6/+6
Shrikanth noted that the per-cpu reference counter was still some 10% slower than the old immutable option (which removes the reference counting entirely). Further optimize the per-cpu reference counter by: - switching from RCU to preempt; - using __this_cpu_*() since we now have preempt disabled; - switching from smp_load_acquire() to READ_ONCE(). This is all safe because disabling preemption inhibits the RCU grace period exactly like rcu_read_lock(). Having preemption disabled allows using __this_cpu_*() provided the only access to the variable is in task context -- which is the case here. Furthermore, since we know changing fph->state to FR_ATOMIC demands a full RCU grace period we can rely on the implied smp_mb() from that to replace the acquire barrier(). This is very similar to the percpu_down_read_internal() fast-path. The reason this is significant for PowerPC is that it uses the generic this_cpu_*() implementation which relies on local_irq_disable() (the x86 implementation relies on it being a single memop instruction to be IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE barrier, not having to use explicit barriers safes a bunch. Combined this reduces the performance gap by half, down to some 5%. Fixes: 760e6f7befba ("futex: Remove support for IMMUTABLE") Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20251106092929.GR4067720@noisy.programming.kicks-ass.net