summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorLines
2026-02-26Merge tag 'dma-mapping-7.0-2026-02-26' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "Two DMA-mapping fixes for the recently merged API rework (Jiri Pirko and Stian Halseth)" * tag 'dma-mapping-7.0-2026-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: sparc: Fix page alignment in dma mapping dma-mapping: avoid random addr value print out on error path
2026-02-26sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flagDavid Carlier-1/+1
SCX_EFLAG_INITIALIZED is the sole member of enum scx_exit_flags with no explicit value, so the compiler assigns it 0. This makes the bitwise OR in scx_ops_init() a no-op: sch->exit_info->flags |= SCX_EFLAG_INITIALIZED; /* |= 0 */ As a result, BPF schedulers cannot distinguish whether ops.init() completed successfully by inspecting exit_info->flags. Assign the value 1LLU << 0 so the flag is actually set. Fixes: f3aec2adce8d ("sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-26bpf: Fix stack-out-of-bounds write in devmapKohei Enju-5/+17
get_upper_ifindexes() iterates over all upper devices and writes their indices into an array without checking bounds. Also the callers assume that the max number of upper devices is MAX_NEST_DEV and allocate excluded_devices[1+MAX_NEST_DEV] on the stack, but that assumption is not correct and the number of upper devices could be larger than MAX_NEST_DEV (e.g., many macvlans), causing a stack-out-of-bounds write. Add a max parameter to get_upper_ifindexes() to avoid the issue. When there are too many upper devices, return -EOVERFLOW and abort the redirect. To reproduce, create more than MAX_NEST_DEV(8) macvlans on a device with an XDP program attached using BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS. Then send a packet to the device to trigger the XDP redirect path. Reported-by: syzbot+10cc7f13760b31bd2e61@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/698c4ce3.050a0220.340abe.000b.GAE@google.com/T/ Fixes: aeea1b86f936 ("bpf, devmap: Exclude XDP broadcast to master device") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Kohei Enju <kohei@enjuk.jp> Link: https://lore.kernel.org/r/20260225053506.4738-1-kohei@enjuk.jp Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-26bpf: Fix kprobe_multi cookies access in show_fdinfo callbackJiri Olsa-1/+3
We don't check if cookies are available on the kprobe_multi link before accessing them in show_fdinfo callback, we should. Cc: stable@vger.kernel.org Fixes: da7e9c0a7fbc ("bpf: Add show_fdinfo for kprobe_multi") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260225111249.186230-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-26Merge tag 'kmalloc_obj-v7.0-rc2' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kmalloc_obj fixes from Kees Cook: - Fix pointer-to-array allocation types for ubd and kcsan - Force size overflow helpers to __always_inline - Bump __builtin_counted_by_ref to Clang 22.1 from 22.0 (Nathan Chancellor) * tag 'kmalloc_obj-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kcsan: test: Adjust "expect" allocation type for kmalloc_obj overflow: Make sure size helpers are always inlined init/Kconfig: Adjust fixed clang version for __builtin_counted_by_ref ubd: Use pointer-to-pointers for io_thread_req arrays
2026-02-26kcsan: test: Adjust "expect" allocation type for kmalloc_objKees Cook-1/+1
The call to kmalloc_obj(observed.lines) returns "char (*)[3][512]", a pointer to the whole 2D array. But "expect" wants to be "char (*)[512]", the decayed pointer type, as if it were observed.lines itself (though without the "3" bounds). This produces the following build error: ../kernel/kcsan/kcsan_test.c: In function '__report_matches': ../kernel/kcsan/kcsan_test.c:171:16: error: assignment to 'char (*)[512]' from incompatible pointer type 'char (*)[3][512]' [-Wincompatible-pointer-types] 171 | expect = kmalloc_obj(observed.lines); | ^ Instead of changing the "expect" type to "char (*)[3][512]" and requiring a dereference at each use (e.g. "(expect*)[0]"), just explicitly cast the return to the desired type. Note that I'm intentionally not switching back to byte-based "kmalloc" here because I cannot find a way for the Coccinelle script (which will be used going forward to catch future conversions) to exclude this case. Tested with: $ ./tools/testing/kunit/kunit.py run \ --kconfig_add CONFIG_DEBUG_KERNEL=y \ --kconfig_add CONFIG_KCSAN=y \ --kconfig_add CONFIG_KCSAN_KUNIT_TEST=y \ --arch=x86_64 --qemu_args '-smp 2' kcsan Reported-by: Nathan Chancellor <nathan@kernel.org> Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types") Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-26kthread: consolidate kthread exit paths to prevent use-after-freeChristian Brauner-36/+11
Guillaume reported crashes via corrupted RCU callback function pointers during KUnit testing. The crash was traced back to the pidfs rhashtable conversion which replaced the 24-byte rb_node with an 8-byte rhash_head in struct pid, shrinking it from 160 to 144 bytes. struct kthread (without CONFIG_BLK_CGROUP) is also 144 bytes. With CONFIG_SLAB_MERGE_DEFAULT and SLAB_HWCACHE_ALIGN both round up to 192 bytes and share the same slab cache. struct pid.rcu.func and struct kthread.affinity_node both sit at offset 0x78. When a kthread exits via make_task_dead() it bypasses kthread_exit() and misses the affinity_node cleanup. free_kthread_struct() frees the memory while the node is still linked into the global kthread_affinity_list. A subsequent list_del() by another kthread writes through dangling list pointers into the freed and reused memory, corrupting the pid's rcu.func pointer. Instead of patching free_kthread_struct() to handle the missed cleanup, consolidate all kthread exit paths. Turn kthread_exit() into a macro that calls do_exit() and add kthread_do_exit() which is called from do_exit() for any task with PF_KTHREAD set. This guarantees that kthread-specific cleanup always happens regardless of the exit path - make_task_dead(), direct do_exit(), or kthread_exit(). Replace __to_kthread() with a new tsk_is_kthread() accessor in the public header. Export do_exit() since module code using the kthread_exit() macro now needs it directly. Reported-by: Guillaume Tucker <gtucker@gtucker.io> Tested-by: Guillaume Tucker <gtucker@gtucker.io> Tested-by: Mark Brown <broonie@kernel.org> Tested-by: David Gow <davidgow@google.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/all/20260224-mittlerweile-besessen-2738831ae7f6@brauner Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 4d13f4304fa4 ("kthread: Implement preferred affinity") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-25sched_ext: Fix out-of-bounds access in scx_idle_init_masks()David Carlier-2/+2
scx_idle_node_masks is allocated with num_possible_nodes() elements but indexed by NUMA node IDs via for_each_node(). On systems with non-contiguous NUMA node numbering (e.g. nodes 0 and 4), node IDs can exceed the array size, causing out-of-bounds memory corruption. Use nr_node_ids instead, which represents the maximum node ID range and is the correct size for arrays indexed by node ID. Fixes: 7c60329e3521 ("sched_ext: Add NUMA-awareness to the default idle selection policy") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-25Merge tag 'vfs-7.0-rc2.fixes' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix an uninitialized variable in file_getattr(). The flags_valid field wasn't initialized before calling vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse - Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is not enabled. sysctl_hung_task_timeout_secs is 0 in that case causing spurious "waiting for writeback completion for more than 1 seconds" warnings - Fix a null-ptr-deref in do_statmount() when the mount is internal - Add missing kernel-doc description for the @private parameter in iomap_readahead() - Fix mount namespace creation to hold namespace_sem across the mount copy in create_new_namespace(). The previous drop-and-reacquire pattern was fragile and failed to clean up mount propagation links if the real rootfs was a shared or dependent mount - Fix /proc mount iteration where m->index wasn't updated when m->show() overflows, causing a restart to repeatedly show the same mount entry in a rapidly expanding mount table - Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the inode number is out of range - Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared. copy_mnt_ns() received the live fs_struct so if a subsequent namespace creation failed the rollback would leave pwd and root pointing to detached mounts. Always allocate a new fs_struct when CLONE_NEWNS is requested - fserror bug fixes: - Remove the unused fsnotify_sb_error() helper now that all callers have been converted to fserror_report_metadata - Fix a lockdep splat in fserror_report() where igrab() takes inode::i_lock which can be held in IRQ context. Replace igrab() with a direct i_count bump since filesystems should not report inodes that are about to be freed or not yet exposed - Handle error pointer in procfs for try_lookup_noperm() - Fix an integer overflow in ep_loop_check_proc() where recursive calls returning INT_MAX would overflow when +1 is added, breaking the recursion depth check - Fix a misleading break in pidfs * tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pidfs: avoid misleading break eventpoll: Fix integer overflow in ep_loop_check_proc() proc: Fix pointer error dereference fserror: fix lockdep complaint when igrabbing inode fsnotify: drop unused helper unshare: fix unshare_fs() handling minix: Correct errno in minix_new_inode namespace: fix proc mount iteration mount: hold namespace_sem across copy in create_new_namespace() iomap: Describe @private in iomap_readahead() statmount: Fix the null-ptr-deref in do_statmount() writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK fs: init flags_valid before calling vfs_fileattr_get
2026-02-25cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslockedChen Ridong-1/+1
A null-pointer-dereference bug was reported by syzbot: Oops: general protection fault, probably for address 0xdffffc0000000000: KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:bitmap_subset include/linux/bitmap.h:433 [inline] RIP: 0010:cpumask_subset include/linux/cpumask.h:836 [inline] RIP: 0010:rebuild_sched_domains_locked kernel/cgroup/cpuset.c:967 RSP: 0018:ffffc90003ecfbc0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000020 RDX: ffff888028de0000 RSI: ffffffff8200f003 RDI: ffffffff8df14f28 RBP: 0000000000000000 R08: 0000000000000cc0 R09: 00000000ffffffff R10: ffffffff8e7d95b3 R11: 0000000000000001 R12: 0000000000000000 R13: 00000000000f4240 R14: dffffc0000000000 R15: 0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f463fff CR3: 000000003704c000 CR4: 00000000003526f0 Call Trace: <TASK> rebuild_sched_domains_cpuslocked kernel/cgroup/cpuset.c:983 [inline] rebuild_sched_domains+0x21/0x40 kernel/cgroup/cpuset.c:990 sched_rt_handler+0xb5/0xe0 kernel/sched/rt.c:2911 proc_sys_call_handler+0x47f/0x5a0 fs/proc/proc_sysctl.c:600 new_sync_write fs/read_write.c:595 [inline] vfs_write+0x6ac/0x1070 fs/read_write.c:688 ksys_write+0x12a/0x250 fs/read_write.c:740 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f The issue occurs when generate_sched_domains() returns ndoms = 1 and doms = NULL due to a kmalloc failure. This leads to a null-pointer dereference when accessing doms in rebuild_sched_domains_locked(). Fix this by adding a NULL check for doms before accessing it. Fixes: 6ee43047e8ad ("cpuset: Remove unnecessary checks in rebuild_sched_domains_locked") Reported-by: syzbot+460792609a79c085f79f@syzkaller.appspotmail.com Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-25perf: Fix __perf_event_overflow() vs perf_remove_from_context() racePeter Zijlstra-1/+41
Make sure that __perf_event_overflow() runs with IRQs disabled for all possible callchains. Specifically the software events can end up running it with only preemption disabled. This opens up a race vs perf_event_exit_event() and friends that will go and free various things the overflow path expects to be present, like the BPF program. Fixes: 592903cdcbf6 ("perf_counter: add an event_list") Reported-by: Simond Hu <cmdhh1767@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Simond Hu <cmdhh1767@gmail.com> Link: https://patch.msgid.link/20260224122909.GV1395416@noisy.programming.kicks-ass.net
2026-02-24sched_ext: Disable preemption between scx_claim_exit() and kicking helper workTejun Heo-0/+12
scx_claim_exit() atomically sets exit_kind, which prevents scx_error() from triggering further error handling. After claiming exit, the caller must kick the helper kthread work which initiates bypass mode and teardown. If the calling task gets preempted between claiming exit and kicking the helper work, and the BPF scheduler fails to schedule it back (since error handling is now disabled), the helper work is never queued, bypass mode never activates, tasks stop being dispatched, and the system wedges. Disable preemption across scx_claim_exit() and the subsequent work kicking in all callers - scx_disable() and scx_vexit(). Add lockdep_assert_preemption_disabled() to scx_claim_exit() to enforce the requirement. Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-24liveupdate: luo_file: remember retrieve() statusPratyush Yadav (Google)-16/+25
LUO keeps track of successful retrieve attempts on a LUO file. It does so to avoid multiple retrievals of the same file. Multiple retrievals cause problems because once the file is retrieved, the serialized data structures are likely freed and the file is likely in a very different state from what the code expects. The retrieve boolean in struct luo_file keeps track of this, and is passed to the finish callback so it knows what work was already done and what it has left to do. All this works well when retrieve succeeds. When it fails, luo_retrieve_file() returns the error immediately, without ever storing anywhere that a retrieve was attempted or what its error code was. This results in an errored LIVEUPDATE_SESSION_RETRIEVE_FD ioctl to userspace, but nothing prevents it from trying this again. The retry is problematic for much of the same reasons listed above. The file is likely in a very different state than what the retrieve logic normally expects, and it might even have freed some serialization data structures. Attempting to access them or free them again is going to break things. For example, if memfd managed to restore 8 of its 10 folios, but fails on the 9th, a subsequent retrieve attempt will try to call kho_restore_folio() on the first folio again, and that will fail with a warning since it is an invalid operation. Apart from the retry, finish() also breaks. Since on failure the retrieved bool in luo_file is never touched, the finish() call on session close will tell the file handler that retrieve was never attempted, and it will try to access or free the data structures that might not exist, much in the same way as the retry attempt. There is no sane way of attempting the retrieve again. Remember the error retrieve returned and directly return it on a retry. Also pass this status code to finish() so it can make the right decision on the work it needs to do. This is done by changing the bool to an integer. A value of 0 means retrieve was never attempted, a positive value means it succeeded, and a negative value means it failed and the error code is the value. Link: https://lkml.kernel.org/r/20260216132221.987987-1-pratyush@kernel.org Fixes: 7c722a7f44e0 ("liveupdate: luo_file: implement file systems callbacks") Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-23cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lockWaiman Long-11/+44
The current cpuset partition code is able to dynamically update the sched domains of a running system and the corresponding HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentially the "isolcpus=domain,..." boot command line feature at run time. The housekeeping cpumask update requires flushing a number of different workqueues which may not be safe with cpus_read_lock() held as the workqueue flushing code may acquire cpus_read_lock() or acquiring locks which have locking dependency with cpus_read_lock() down the chain. Below is an example of such circular locking problem. ====================================================== WARNING: possible circular locking dependency detected 6.18.0-test+ #2 Tainted: G S ------------------------------------------------------ test_cpuset_prs/10971 is trying to acquire lock: ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180 but task is already holding lock: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (cpuset_mutex){+.+.}-{4:4}: -> #3 (cpu_hotplug_lock){++++}-{0:0}: -> #2 (rtnl_mutex){+.+.}-{4:4}: -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: Chain exists of: (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex 5 locks held by test_cpuset_prs/10971: #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0 #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0 #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130 #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 Call Trace: <TASK> : touch_wq_lockdep_map+0x93/0x180 __flush_workqueue+0x111/0x10b0 housekeeping_update+0x12d/0x2d0 update_parent_effective_cpumask+0x595/0x2440 update_prstate+0x89d/0xce0 cpuset_partition_write+0xc5/0x130 cgroup_file_write+0x1a5/0x680 kernfs_fop_write_iter+0x3df/0x5f0 vfs_write+0x525/0xfd0 ksys_write+0xf9/0x1d0 do_syscall_64+0x95/0x520 entry_SYSCALL_64_after_hwframe+0x76/0x7e To avoid such a circular locking dependency problem, we have to call housekeeping_update() without holding the cpus_read_lock() and cpuset_mutex. The current set of wq's flushed by housekeeping_update() may not have work functions that call cpus_read_lock() directly, but we are likely to extend the list of wq's that are flushed in the future. Moreover, the current set of work functions may hold locks that may have cpu_hotplug_lock down the dependency chain. So housekeeping_update() is now called after releasing cpus_read_lock and cpuset_mutex at the end of a cpuset operation. These two locks are then re-acquired later before calling rebuild_sched_domains_locked(). To enable mutual exclusion between the housekeeping_update() call and other cpuset control file write actions, a new top level cpuset_top_mutex is introduced. This new mutex will be acquired first to allow sharing variables used by both code paths. However, cpuset update from CPU hotplug can still happen in parallel with the housekeeping_update() call, though that should be rare in production environment. As cpus_read_lock() is now no longer held when tmigr_isolated_exclude_cpumask() is called, it needs to acquire it directly. The lockdep_is_cpuset_held() is also updated to return true if either cpuset_top_mutex or cpuset_mutex is held. Fixes: 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueueWaiman Long-5/+26
The cpuset_handle_hotplug() may need to invoke housekeeping_update(), for instance, when an isolated partition is invalidated because its last active CPU has been put offline. As we are going to enable dynamic update to the nozh_full housekeeping cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug, allowing the CPU hotplug path to call into housekeeping_update() directly from update_isolation_cpumasks() will likely cause deadlock. So we have to defer any call to housekeeping_update() after the CPU hotplug operation has finished. This is now done via the workqueue where the update_hk_sched_domains() function will be invoked via the hk_sd_workfn(). An concurrent cpuset control file write may have executed the required update_hk_sched_domains() function before the work function is called. So the work function call may become a no-op when it is invoked. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() togetherWaiman Long-27/+24
With the latest changes in sched/isolation.c, rebuild_sched_domains*() requires the HK_TYPE_DOMAIN housekeeping cpumask to be properly updated first, if needed, before the sched domains can be rebuilt. So the two naturally fit together. Do that by creating a new update_hk_sched_domains() helper to house both actions. The name of the isolated_cpus_updating flag to control the call to housekeeping_update() is now outdated. So change it to update_housekeeping to better reflect its purpose. Also move the call to update_hk_sched_domains() to the end of cpuset and hotplug operations before releasing the cpuset_mutex. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changedWaiman Long-3/+7
As cpuset is updating HK_TYPE_DOMAIN housekeeping mask when there is a change in the set of isolated CPUs, making this change is now more costly than before. Right now, the isolated_cpus_updating flag can be set even if there is no real change in isolated_cpus. Put in additional checks to make sure that isolated_cpus_updating is set only if there is a real change in isolated_cpus. Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23cgroup/cpuset: Clarify exclusion rules for cpuset internal variablesWaiman Long-44/+61
Clarify the locking rules associated with file level internal variables inside the cpuset code. There is no functional change. Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in ↵Waiman Long-1/+1
update_cpumasks_hier() Commit e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") incorrectly changed the 2nd parameter of cpuset_update_tasks_cpumask() from tmp->new_cpus to cp->effective_cpus. This second parameter is just a temporary cpumask for internal use. The cpuset_update_tasks_cpumask() function was originally called update_tasks_cpumask() before commit 381b53c3b549 ("cgroup/cpuset: rename functions shared between v1 and v2"). This mistake can incorrectly change the effective_cpus of the cpuset when it is the top_cpuset or in arm64 architecture where task_cpu_possible_mask() may differ from cpu_possible_mask. So far top_cpuset hasn't been passed to update_cpumasks_hier() yet, but arm64 arch can still be impacted. Fix it by reverting the incorrect change. Fixes: e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()Waiman Long-1/+1
The effective_xcpus of a cpuset can contain offline CPUs. In partition_xcpus_del(), the xcpus parameter is incorrectly used as a temporary cpumask to mask out offline CPUs. As xcpus can be the effective_xcpus of a cpuset, this can result in unexpected changes in that cpumask. Fix this problem by not making any changes to the xcpus parameter. Fixes: 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions") Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23module: Fix kernel panic when a symbol st_shndx is out of boundsIhor Solodrai-0/+7
The module loader doesn't check for bounds of the ELF section index in simplify_symbols(): for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) { const char *name = info->strtab + sym[i].st_name; switch (sym[i].st_shndx) { case SHN_COMMON: [...] default: /* Divert to percpu allocation if a percpu var. */ if (sym[i].st_shndx == info->index.pcpu) secbase = (unsigned long)mod_percpu(mod); else /** HERE --> **/ secbase = info->sechdrs[sym[i].st_shndx].sh_addr; sym[i].st_value += secbase; break; } } A symbol with an out-of-bounds st_shndx value, for example 0xffff (known as SHN_XINDEX or SHN_HIRESERVE), may cause a kernel panic: BUG: unable to handle page fault for address: ... RIP: 0010:simplify_symbols+0x2b2/0x480 ... Kernel panic - not syncing: Fatal exception This can happen when module ELF is legitimately using SHN_XINDEX or when it is corrupted. Add a bounds check in simplify_symbols() to validate that st_shndx is within the valid range before using it. This issue was discovered due to a bug in llvm-objcopy, see relevant discussion for details [1]. [1] https://lore.kernel.org/linux-modules/20251224005752.201911-1-ihor.solodrai@linux.dev/ Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2026-02-23Remove WARN_ALL_UNSEEDED_RANDOM kernel config optionLinus Torvalds-1/+0
This config option goes way back - it used to be an internal debug option to random.c (at that point called DEBUG_RANDOM_BOOT), then was renamed and exposed as a config option as CONFIG_WARN_UNSEEDED_RANDOM, and then further renamed to the current CONFIG_WARN_ALL_UNSEEDED_RANDOM. It was all done with the best of intentions: the more limited rate-limited reports were reporting some cases, but if you wanted to see all the gory details, you'd enable this "ALL" option. However, it turns out - perhaps not surprisingly - that when people don't care about and fix the first rate-limited cases, they most certainly don't care about any others either, and so warning about all of them isn't actually helping anything. And the non-ratelimited reporting causes problems, where well-meaning people enable debug options, but the excessive flood of messages that nobody cares about will hide actual real information when things go wrong. I just got a kernel bug report (which had nothing to do with randomness) where two thirds of the the truncated dmesg was just variations of random: get_random_u32 called from __get_random_u32_below+0x10/0x70 with crng_init=0 and in the process early boot messages had been lost (in addition to making the messages that _hadn't_ been lost harder to read). The proper way to find these things for the hypothetical developer that cares - if such a person exists - is almost certainly with boot time tracing. That gives you the option to get call graphs etc too, which is likely a requirement for fixing any problems anyway. See Documentation/trace/boottime-trace.rst for that option. And if we for some reason do want to re-introduce actual printing of these things, it will need to have some uniqueness filtering rather than this "just print it all" model. Fixes: cc1e127bfa95 ("random: remove ratelimiting for in-kernel unseeded randomness") Acked-by: Jason Donenfeld <Jason@zx2c4.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-23module: Fix the modversions and signing submenusPetr Pavlu-10/+13
The module Kconfig file contains a set of options related to "Module versioning support" (depends on MODVERSIONS) and "Module signature verification" (depends on MODULE_SIG). The Kconfig tool automatically creates submenus when an entry for a symbol is followed by consecutive items that all depend on the symbol. However, this functionality doesn't work for the mentioned module options. The MODVERSIONS options are interleaved with ASM_MODVERSIONS, which has no 'depends on MODVERSIONS' but instead uses 'default HAVE_ASM_MODVERSIONS && MODVERSIONS'. Similarly, the MODULE_SIG options are interleaved by a comment warning not to forget signing modules with scripts/sign-file, which uses the condition 'depends on MODULE_SIG_FORCE && !MODULE_SIG_ALL'. The result is that the options are confusingly shown when using a menuconfig tool, as follows: [*] Module versioning support Module versioning implementation (genksyms (from source code)) ---> [ ] Extended Module Versioning Support [*] Basic Module Versioning Support [*] Source checksum for all modules [*] Module signature verification [ ] Require modules to be validly signed [ ] Automatically sign all modules Hash algorithm to sign modules (SHA-256) ---> Fix the issue by using if/endif to group related options together in kernel/module/Kconfig, similarly to how the MODULE_DEBUG options are already grouped. Note that the signing-related options depend on 'MODULE_SIG || IMA_APPRAISE_MODSIG', with the exception of MODULE_SIG_FORCE, which is valid only for MODULE_SIG and is therefore kept separately. For consistency, do the same for the MODULE_COMPRESS entries. The options are then properly placed into submenus, as follows: [*] Module versioning support Module versioning implementation (genksyms (from source code)) ---> [ ] Extended Module Versioning Support [*] Basic Module Versioning Support [*] Source checksum for all modules [*] Module signature verification [ ] Require modules to be validly signed [ ] Automatically sign all modules Hash algorithm to sign modules (SHA-256) ---> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2026-02-23module: Remove duplicate freeing of lockdep classesPetr Pavlu-6/+0
In the error path of load_module(), under the free_module label, the code calls lockdep_free_key_range() to release lock classes associated with the MOD_DATA, MOD_RODATA and MOD_RO_AFTER_INIT module regions, and subsequently invokes module_deallocate(). Since commit ac3b43283923 ("module: replace module_layout with module_memory"), the module_deallocate() function calls free_mod_mem(), which releases the lock classes as well and considers all module regions. Attempting to free these classes twice is unnecessary. Remove the redundant code in load_module(). Fixes: ac3b43283923 ("module: replace module_layout with module_memory") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2026-02-23perf/core: Fix refcount bug and potential UAF in perf_mmapHaocheng Yu-19/+19
Syzkaller reported a refcount_t: addition on 0; use-after-free warning in perf_mmap. The issue is caused by a race condition between a failing mmap() setup and a concurrent mmap() on a dependent event (e.g., using output redirection). In perf_mmap(), the ring_buffer (rb) is allocated and assigned to event->rb with the mmap_mutex held. The mutex is then released to perform map_range(). If map_range() fails, perf_mmap_close() is called to clean up. However, since the mutex was dropped, another thread attaching to this event (via inherited events or output redirection) can acquire the mutex, observe the valid event->rb pointer, and attempt to increment its reference count. If the cleanup path has already dropped the reference count to zero, this results in a use-after-free or refcount saturation warning. Fix this by extending the scope of mmap_mutex to cover the map_range() call. This ensures that the ring buffer initialization and mapping (or cleanup on failure) happens atomically effectively, preventing other threads from accessing a half-initialized or dying ring buffer. Closes: https://lore.kernel.org/oe-kbuild-all/202602020208.m7KIjdzW-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Haocheng Yu <yuhaocheng035@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260202162057.7237-1-yuhaocheng035@gmail.com
2026-02-23perf/core: Fix invalid wait context in ctx_sched_in()Namhyung Kim-1/+2
Lockdep found a bug in the event scheduling when a pinned event was failed and wakes up the threads in the ring buffer like below. It seems it should not grab a wait-queue lock under perf-context lock. Let's do it with irq_work. [ 39.913691] ============================= [ 39.914157] [ BUG: Invalid wait context ] [ 39.914623] 6.15.0-next-20250530-next-2025053 #1 Not tainted [ 39.915271] ----------------------------- [ 39.915731] repro/837 is trying to lock: [ 39.916191] ffff88801acfabd8 (&event->waitq){....}-{3:3}, at: __wake_up+0x26/0x60 [ 39.917182] other info that might help us debug this: [ 39.917761] context-{5:5} [ 39.918079] 4 locks held by repro/837: [ 39.918530] #0: ffffffff8725cd00 (rcu_read_lock){....}-{1:3}, at: __perf_event_task_sched_in+0xd1/0xbc0 [ 39.919612] #1: ffff88806ca3c6f8 (&cpuctx_lock){....}-{2:2}, at: __perf_event_task_sched_in+0x1a7/0xbc0 [ 39.920748] #2: ffff88800d91fc18 (&ctx->lock){....}-{2:2}, at: __perf_event_task_sched_in+0x1f9/0xbc0 [ 39.921819] #3: ffffffff8725cd00 (rcu_read_lock){....}-{1:3}, at: perf_event_wakeup+0x6c/0x470 Fixes: f4b07fd62d4d ("perf/core: Use POLLHUP for a pinned event in error") Closes: https://lore.kernel.org/lkml/aD2w50VDvGIH95Pf@ly-workstation Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com> Link: https://patch.msgid.link/20250603045105.1731451-1-namhyung@kernel.org
2026-02-23rseq: slice ext: Ensure rseq feature size differs from original rseq sizeMathieu Desnoyers-1/+2
Before rseq became extensible, its original size was 32 bytes even though the active rseq area was only 20 bytes. This had the following impact in terms of userspace ecosystem evolution: * The GNU libc between 2.35 and 2.39 expose a __rseq_size symbol set to 32, even though the size of the active rseq area is really 20. * The GNU libc 2.40 changes this __rseq_size to 20, thus making it express the active rseq area. * Starting from glibc 2.41, __rseq_size corresponds to the AT_RSEQ_FEATURE_SIZE from getauxval(3). This means that users of __rseq_size can always expect it to correspond to the active rseq area, except for the value 32, for which the active rseq area is 20 bytes. Exposing a 32 bytes feature size would make life needlessly painful for userspace. Therefore, add a reserved field at the end of the rseq area to bump the feature size to 33 bytes. This reserved field is expected to be replaced with whatever field will come next, expecting that this field will be larger than 1 byte. The effect of this change is to increase the size from 32 to 64 bytes before we actually have fields using that memory. Clarify the allocation size and alignment requirements in the struct rseq uapi comment. Change the value returned by getauxval(AT_RSEQ_ALIGN) to return the value of the active rseq area size rounded up to next power of 2, which guarantees that the rseq structure will always be aligned on the nearest power of two large enough to contain it, even as it grows. Change the alignment check in the rseq registration accordingly. This will minimize the amount of ABI corner-cases we need to document and require userspace to play games with. The rule stays simple when __rseq_size != 32: #define rseq_field_available(field) (__rseq_size >= offsetofend(struct rseq_abi, field)) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260220200642.1317826-3-mathieu.desnoyers@efficios.com
2026-02-23rseq: Clarify rseq registration rseq_size bound check commentMathieu Desnoyers-2/+3
The rseq registration validates that the rseq_size argument is greater or equal to 32 (the original rseq size), but the comment associated with this check does not clearly state this. Clarify the comment to that effect. Fixes: ee3e3ac05c26 ("rseq: Introduce extensible rseq ABI") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260220200642.1317826-2-mathieu.desnoyers@efficios.com
2026-02-23sched/core: Fix wakeup_preempt's next_class trackingPeter Zijlstra-4/+16
Kernel test robot reported that tools/testing/selftests/kvm/hardware_disable_test was failing due to commit 704069649b5b ("sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()") It turns out there were two related problems that could lead to a missed preemption: - when hitting newidle balance from the idle thread, it would elevate rb->next_class from &idle_sched_class to &fair_sched_class, causing later wakeup_preempt() calls to not hit the sched_class_above() case, and not issue resched_curr(). Notably, this modification pattern should only lower the next_class, and never raise it. Create two new helper functions to wrap this. - when doing schedule_idle(), it was possible to miss (re)setting rq->next_class to &idle_sched_class, leading to the very same problem. Cc: Sean Christopherson <seanjc@google.com> Fixes: 704069649b5b ("sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202602122157.4e861298-lkp@intel.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260218163329.GQ1395416@noisy.programming.kicks-ass.net
2026-02-23sched/fair: Fix lag clampPeter Zijlstra-4/+35
Vincent reported that he was seeing undue lag clamping in a mixed slice workload. Implement the max_slice tracking as per the todo comment. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Reported-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20250422101628.GA33555@noisy.programming.kicks-ass.net
2026-02-23sched/eevdf: Update se->vprot in reweight_entity()Wang Tao-0/+12
In the EEVDF framework with Run-to-Parity protection, `se->vprot` is an independent variable defining the virtual protection timestamp. When `reweight_entity()` is called (e.g., via nice/renice), it performs the following actions to preserve Lag consistency: 1. Scales `se->vlag` based on the new weight. 2. Calls `place_entity()`, which recalculates `se->vruntime` based on the new weight and scaled lag. However, the current implementation fails to update `se->vprot`, leading to mismatches between the task's actual runtime and its expected duration. Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption") Suggested-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Wang Tao <wangtao554@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260120123113.3518950-1-wangtao554@huawei.com
2026-02-23sched/fair: Only set slice protection at pick timePeter Zijlstra-5/+6
We should not (re)set slice protection in the sched_change pattern which calls put_prev_task() / set_next_task(). Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260219080624.561421378%40infradead.org
2026-02-23sched/fair: Fix zero_vruntime trackingPeter Zijlstra-27/+57
It turns out that zero_vruntime tracking is broken when there is but a single task running. Current update paths are through __{en,de}queue_entity(), and when there is but a single task, pick_next_task() will always return that one task, and put_prev_set_next_task() will end up in neither function. This can cause entity_key() to grow indefinitely large and cause overflows, leading to much pain and suffering. Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which are called from {set_next,put_prev}_entity() has problems because: - set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se. This means the avg_vruntime() will see the removal but not current, missing the entity for accounting. - put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr = NULL. This means the avg_vruntime() will see the addition *and* current, leading to double accounting. Both cases are incorrect/inconsistent. Noting that avg_vruntime is already called on each {en,de}queue, remove the explicit avg_vruntime() calls (which removes an extra 64bit division for each {en,de}queue) and have avg_vruntime() update zero_vruntime itself. Additionally, have the tick call avg_vruntime() -- discarding the result, but for the side-effect of updating zero_vruntime. While there, optimize avg_vruntime() by noting that the average of one value is rather trivial to compute. Test case: # taskset -c -p 1 $$ # taskset -c 2 bash -c 'while :; do :; done&' # cat /sys/kernel/debug/sched/debug | awk '/^cpu#/ {P=0} /^cpu#2,/ {P=1} {if (P) print $0}' | grep -e zero_vruntime -e "^>" PRE: .zero_vruntime : 31316.407903 >R bash 487 50787.345112 E 50789.145972 2.800000 50780.298364 16 120 0.000000 0.000000 0.000000 / .zero_vruntime : 382548.253179 >R bash 487 427275.204288 E 427276.003584 2.800000 427268.157540 23 120 0.000000 0.000000 0.000000 / POST: .zero_vruntime : 17259.709467 >R bash 526 17259.709467 E 17262.509467 2.800000 16915.031624 9 120 0.000000 0.000000 0.000000 / .zero_vruntime : 18702.723356 >R bash 526 18702.723356 E 18705.523356 2.800000 18358.045513 9 120 0.000000 0.000000 0.000000 / Fixes: 79f3f9bedd14 ("sched/eevdf: Fix min_vruntime vs avg_vruntime") Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260219080624.438854780%40infradead.org
2026-02-23dma-mapping: avoid random addr value print out on error pathJiri Pirko-1/+1
dma_addr is unitialized in dma_direct_map_phys() when swiotlb is forced and DMA_ATTR_MMIO is set which leads to random value print out in warning. Fix that by just returning DMA_MAPPING_ERROR. Fixes: e53d29f957b3 ("dma-mapping: convert dma_direct_*map_page to be phys_addr_t based") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260209153809.250835-2-jiri@resnulli.us
2026-02-22Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL usesKees Cook-21/+18
Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds-32/+16
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds-20/+20
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds-337/+337
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook-551/+543
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-20Merge tag 'trace-v7.0-2' of ↵Linus Torvalds-5/+23
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix possible dereference of uninitialized pointer When validating the persistent ring buffer on boot up, if the first validation fails, a reference to "head_page" is performed in the error path, but it skips over the initialization of that variable. Move the initialization before the first validation check. - Fix use of event length in validation of persistent ring buffer On boot up, the persistent ring buffer is checked to see if it is valid by several methods. One being to walk all the events in the memory location to make sure they are all valid. The length of the event is used to move to the next event. This length is determined by the data in the buffer. If that length is corrupted, it could possibly make the next event to check located at a bad memory location. Validate the length field of the event when doing the event walk. - Fix function graph on archs that do not support use of ftrace_ops When an architecture defines HAVE_DYNAMIC_FTRACE_WITH_ARGS, it means that its function graph tracer uses the ftrace_ops of the function tracer to call its callbacks. This allows a single registered callback to be called directly instead of checking the callback's meta data's hash entries against the function being traced. For architectures that do not support this feature, it must always call the loop function that tests each registered callback (even if there's only one). The loop function tests each callback's meta data against its hash of functions and will call its callback if the function being traced is in its hash map. The issue was that there was no check against this and the direct function was being called even if the architecture didn't support it. This meant that if function tracing was enabled at the same time as a callback was registered with the function graph tracer, its callback would be called for every function that the function tracer also traced, even if the callback's meta data only wanted to be called back for a small subset of functions. Prevent the direct calling for those architectures that do not support it. - Fix references to trace_event_file for hist files The hist files used event_file_data() to get a reference to the associated trace_event_file the histogram was attached to. This would return a pointer even if the trace_event_file is about to be freed (via RCU). Instead it should use the event_file_file() helper that returns NULL if the trace_event_file is marked to be freed so that no new references are added to it. - Wake up hist poll readers when an event is being freed When polling on a hist file, the task is only awoken when a hist trigger is triggered. This means that if an event is being freed while there's a task waiting on its hist file, it will need to wait until the hist trigger occurs to wake it up and allow the freeing to happen. Note, the event will not be completely freed until all references are removed, and a hist poller keeps a reference. But it should still be woken when the event is being freed. * tag 'trace-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Wake up poll waiters for hist files when removing an event tracing: Fix checking of freed trace_event_file for hist files fgraph: Do not call handlers direct when not using ftrace_ops tracing: ring-buffer: Fix to check event length before using ring-buffer: Fix possible dereference of uninitialized pointer
2026-02-19tracing: Wake up poll waiters for hist files when removing an eventPetr Pavlu-0/+3
The event_hist_poll() function attempts to verify whether an event file is being removed, but this check may not occur or could be unnecessarily delayed. This happens because hist_poll_wakeup() is currently invoked only from event_hist_trigger() when a hist command is triggered. If the event file is being removed, no associated hist command will be triggered and a waiter will be woken up only after an unrelated hist command is triggered. Fix the issue by adding a call to hist_poll_wakeup() in remove_event_file_dir() after setting the EVENT_FILE_FL_FREED flag. This ensures that a task polling on a hist file is woken up and receives EPOLLERR. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-3-petr.pavlu@suse.com Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19tracing: Fix checking of freed trace_event_file for hist filesPetr Pavlu-2/+2
The event_hist_open() and event_hist_poll() functions currently retrieve a trace_event_file pointer from a file struct by invoking event_file_data(), which simply returns file->f_inode->i_private. The functions then check if the pointer is NULL to determine whether the event is still valid. This approach is flawed because i_private is assigned when an eventfs inode is allocated and remains set throughout its lifetime. Instead, the code should call event_file_file(), which checks for EVENT_FILE_FL_FREED. Using the incorrect access function may result in the code potentially opening a hist file for an event that is being removed or becoming stuck while polling on this file. Correct the access method to event_file_file() in both functions. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-2-petr.pavlu@suse.com Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19fgraph: Do not call handlers direct when not using ftrace_opsSteven Rostedt-1/+11
The function graph tracer was modified to us the ftrace_ops of the function tracer. This simplified the code as well as allowed more features of the function graph tracer. Not all architectures were converted over as it required the implementation of HAVE_DYNAMIC_FTRACE_WITH_ARGS to implement. For those architectures, it still did it the old way where the function graph tracer handle was called by the function tracer trampoline. The handler then had to check the hash to see if the registered handlers wanted to be called by that function or not. In order to speed up the function graph tracer that used ftrace_ops, if only one callback was registered with function graph, it would call its function directly via a static call. Now, if the architecture does not support the use of using ftrace_ops and still has the ftrace function trampoline calling the function graph handler, then by doing a direct call it removes the check against the handler's hash (list of functions it wants callbacks to), and it may call that handler for functions that the handler did not request calls for. On 32bit x86, which does not support the ftrace_ops use with function graph tracer, it shows the issue: ~# trace-cmd start -p function -l schedule ~# trace-cmd show # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 2) * 11898.94 us | schedule(); 3) # 1783.041 us | schedule(); 1) | schedule() { ------------------------------------------ 1) bash-8369 => kworker-7669 ------------------------------------------ 1) | schedule() { ------------------------------------------ 1) kworker-7669 => bash-8369 ------------------------------------------ 1) + 97.004 us | } 1) | schedule() { [..] Now by starting the function tracer is another instance: ~# trace-cmd start -B foo -p function This causes the function graph tracer to trace all functions (because the function trace calls the function graph tracer for each on, and the function graph trace is doing a direct call): ~# trace-cmd show # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) 1.669 us | } /* preempt_count_sub */ 1) + 10.443 us | } /* _raw_spin_unlock_irqrestore */ 1) | tick_program_event() { 1) | clockevents_program_event() { 1) 1.044 us | ktime_get(); 1) 6.481 us | lapic_next_event(); 1) + 10.114 us | } 1) + 11.790 us | } 1) ! 181.223 us | } /* hrtimer_interrupt */ 1) ! 184.624 us | } /* __sysvec_apic_timer_interrupt */ 1) | irq_exit_rcu() { 1) 0.678 us | preempt_count_sub(); When it should still only be tracing the schedule() function. To fix this, add a macro FGRAPH_NO_DIRECT to be set to 0 when the architecture does not support function graph use of ftrace_ops, and set to 1 otherwise. Then use this macro to know to allow function graph tracer to call the handlers directly or not. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://patch.msgid.link/20260218104244.5f14dade@gandalf.local.home Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize entry function") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19tracing: ring-buffer: Fix to check event length before usingMasami Hiramatsu (Google)-1/+5
Check the event length before adding it for accessing next index in rb_read_data_buffer(). Since this function is used for validating possibly broken ring buffers, the length of the event could be broken. In that case, the new event (e + len) can point a wrong address. To avoid invalid memory access at boot, check whether the length of each event is in the possible range before using it. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events") Link: https://patch.msgid.link/177123421541.142205.9414352170164678966.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19ring-buffer: Fix possible dereference of uninitialized pointerDaniil Dulov-1/+2
There is a pointer head_page in rb_meta_validate_events() which is not initialized at the beginning of a function. This pointer can be dereferenced if there is a failure during reader page validation. In this case the control is passed to "invalid" label where the pointer is dereferenced in a loop. To fix the issue initialize orig_head and head_page before calling rb_validate_buffer. Found by Linux Verification Center (linuxtesting.org) with SVACE. Cc: stable@vger.kernel.org Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260213100130.2013839-1-d.dulov@aladdin.ru Closes: https://lore.kernel.org/r/202406130130.JtTGRf7W-lkp@intel.com/ Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events") Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds-23/+25
Pull bpf fixes from Alexei Starovoitov: - Fix invalid write loop logic in libbpf's bpf_linker__add_buf() (Amery Hung) - Fix a potential use-after-free of BTF object (Anton Protopopov) - Add feature detection to libbpf and avoid moving arena global variables on older kernels (Emil Tsalapatis) - Remove extern declaration of bpf_stream_vprintk() from libbpf headers (Ihor Solodrai) - Fix truncated netlink dumps in bpftool (Jakub Kicinski) - Fix map_kptr grace period wait in bpf selftests (Kumar Kartikeya Dwivedi) - Remove hexdump dependency while building bpf selftests (Matthieu Baerts) - Complete fsession support in BPF trampolines on riscv (Menglong Dong) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Remove hexdump dependency libbpf: Remove extern declaration of bpf_stream_vprintk() selftests/bpf: Use vmlinux.h in test_xdp_meta bpftool: Fix truncated netlink dumps libbpf: Delay feature gate check until object prepare time libbpf: Do not use PROG_TYPE_TRACEPOINT program for feature gating bpf: Add a map/btf from a fd array more consistently selftests/bpf: Fix map_kptr grace period wait selftests/bpf: enable fsession_test on riscv64 selftests/bpf: Adjust selftest due to function rename bpf, riscv: add fsession support for trampolines bpf: Fix a potential use-after-free of BTF object bpf, riscv: introduce emit_store_stack_imm64() for trampoline libbpf: Fix invalid write loop logic in bpf_linker__add_buf() libbpf: Add gating for arena globals relocation feature
2026-02-18Merge tag 'mm-nonmm-stable-2026-02-18-19-56' of ↵Linus Torvalds-13/+15
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more non-MM updates from Andrew Morton: - "two fixes in kho_populate()" fixes a couple of not-major issues in the kexec handover code (Ran Xiaokai) - misc singletons * tag 'mm-nonmm-stable-2026-02-18-19-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: lib/group_cpus: handle const qualifier from clusters allocation type kho: remove unnecessary WARN_ON(err) in kho_populate() kho: fix missing early_memunmap() call in kho_populate() scripts/gdb: implement x86_page_ops in mm.py objpool: fix the overestimation of object pooling metadata size selftests/memfd: use IPC semaphore instead of SIGSTOP/SIGCONT delayacct: fix build regression on accounting tool
2026-02-18Merge tag 'mm-stable-2026-02-18-19-48' of ↵Linus Torvalds-19/+37
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "mm/vmscan: fix demotion targets checks in reclaim/demotion" fixes a couple of issues in the demotion code - pages were failed demotion and were finding themselves demoted into disallowed nodes (Bing Jiao) - "Remove XA_ZERO from error recovery of dup_mmap()" fixes a rare mapledtree race and performs a number of cleanups (Liam Howlett) - "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use them" implements a lot of cleanups following on from the conversion of the VMA flags into a bitmap (Lorenzo Stoakes) - "support batch checking of references and unmapping for large folios" implements batching to greatly improve the performance of reclaiming clean file-backed large folios (Baolin Wang) - "selftests/mm: add memory failure selftests" does as claimed (Miaohe Lin) * tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (36 commits) mm/page_alloc: clear page->private in free_pages_prepare() selftests/mm: add memory failure dirty pagecache test selftests/mm: add memory failure clean pagecache test selftests/mm: add memory failure anonymous page test mm: rmap: support batched unmapping for file large folios arm64: mm: implement the architecture-specific clear_flush_young_ptes() arm64: mm: support batch clearing of the young flag for large folios arm64: mm: factor out the address and ptep alignment into a new helper mm: rmap: support batched checks of the references for large folios tools/testing/vma: add VMA userland tests for VMA flag functions tools/testing/vma: separate out vma_internal.h into logical headers tools/testing/vma: separate VMA userland tests into separate files mm: make vm_area_desc utilise vma_flags_t only mm: update all remaining mmap_prepare users to use vma_flags_t mm: update shmem_[kernel]_file_*() functions to use vma_flags_t mm: update secretmem to use VMA flags on mmap_prepare mm: update hugetlbfs to use VMA flags on mmap_prepare mm: add basic VMA flag operation helper functions tools: bitmap: add missing bitmap_[subset(), andnot()] mm: add mk_vma_flags() bitmap flag macro helper ...
2026-02-18Merge tag 'sysctl-7.00-rc1' of ↵Linus Torvalds-42/+385
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Remove macros from proc handler converters Replace the proc converter macros with "regular" functions. Though it is more verbose than the macro version, it helps when debugging and better aligns with coding-style.rst. - General cleanup Remove superfluous ctl_table forward declarations. Const qualify the memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays. Add missing kernel doc to proc_dointvec_conv. - Testing This series was run through sysctl selftests/kunit test suite in x86_64. And went into linux-next after rc4, giving it a good 3 weeks of testing * tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions sysctl: Replace unidirectional INT converter macros with functions sysctl: Add kernel doc to proc_douintvec_conv sysctl: Replace UINT converter macros with functions sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros sysctl: clarify proc_douintvec_minmax doc sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n sysctl: Remove unused ctl_table forward declarations loadpin: Implement custom proc_handler for enforce alloc_tag: move memory_allocation_profiling_sysctls into .rodata sysctl: Add missing kernel-doc for proc_dointvec_conv
2026-02-18unshare: fix unshare_fs() handlingAl Viro-1/+1
There's an unpleasant corner case in unshare(2), when we have a CLONE_NEWNS in flags and current->fs hadn't been shared at all; in that case copy_mnt_ns() gets passed current->fs instead of a private copy, which causes interesting warts in proof of correctness] > I guess if private means fs->users == 1, the condition could still be true. Unfortunately, it's worse than just a convoluted proof of correctness. Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS (and current->fs->users == 1). We pass current->fs to copy_mnt_ns(), all right. Suppose it succeeds and flips current->fs->{pwd,root} to corresponding locations in the new namespace. Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM). We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's destroyed and its mount tree is dissolved, but... current->fs->root and current->fs->pwd are both left pointing to now detached mounts. They are pinning those, so it's not a UAF, but it leaves the calling process with unshare(2) failing with -ENOMEM _and_ leaving it with pwd and root on detached isolated mounts. The last part is clearly a bug. There is other fun related to that mess (races with pivot_root(), including the one between pivot_root() and fork(), of all things), but this one is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new fs_struct even if it hadn't been shared in the first place". Sure, we could go for something like "if both CLONE_NEWNS *and* one of the things that might end up failing after copy_mnt_ns() call in create_new_namespaces() are set, force allocation of new fs_struct", but let's keep it simple - the cost of copy_fs_struct() is trivial. Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets a freshly allocated fs_struct, yet to be attached to anything. That seriously simplifies the analysis... FWIW, that bug had been there since the introduction of unshare(2) ;-/ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://patch.msgid.link/20260207082524.GE3183987@ZenIV Tested-by: Waiman Long <longman@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>