linux - Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/

Age	Commit message (Collapse)	Author	Lines
2026-02-03	Merge branch 'v6.19-rc8'	Peter Zijlstra	-372/+644
	Update to avoid conflicts with /urgent patches. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2026-02-02	liveupdate: luo_file: do not clear serialized_data on unfreeze	Pratyush Yadav (Google)	-2/+0
	Patch series "liveupdate: fixes in error handling". This series contains some fixes in LUO's error handling paths. The first patch deals with failed freeze() attempts. The cleanup path calls unfreeze, and that clears some data needed by later unpreserve calls. The second patch is a bit more involved. It deals with failed retrieve() attempts. To do so properly, it reworks some of the error handling logic in luo_file core. Both these fixes are "theoretical" -- in the sense that I have not been able to reproduce either of them in normal operation. The only supported file type right now is memfd, and there is nothing userspace can do right now to make it fail its retrieve or freeze. I need to make the retrieve or freeze fail by artificially injecting errors. The injected errors trigger a use-after-free and a double-free. That said, once more complex file handlers are added or memfd preservation is used in ways not currently expected or covered by the tests, we will be able to see them on real systems. This patch (of 2): The unfreeze operation is supposed to undo the effects of the freeze operation. serialized_data is not set by freeze, but by preserve. Consequently, the unpreserve operation needs to access serialized_data to undo the effects of the preserve operation. This includes freeing the serialized data structures for example. If a freeze callback fails, unfreeze is called for all frozen files. This would clear serialized_data for them. Since live update has failed, it can be expected that userspace aborts, releasing all sessions. When the sessions are released, unpreserve will be called for all files. The unfrozen files will see 0 in their serialized_data. This is not expected by file handlers, and they might either fail, leaking data and state, or might even crash or cause invalid memory access. Do not clear serialized_data on unfreeze so it gets passed on to unpreserve. There is no need to clear it on unpreserve since luo_file will be freed immediately after. Link: https://lkml.kernel.org/r/20260126230302.2936817-1-pratyush@kernel.org Link: https://lkml.kernel.org/r/20260126230302.2936817-2-pratyush@kernel.org Fixes: 7c722a7f44e0 ("liveupdate: luo_file: implement file systems callbacks") Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-02	bpf: Replace snprintf("%s") with strscpy	Thorsten Blum	-2/+3
	Replace snprintf("%s") with the faster and more direct strscpy(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://lore.kernel.org/r/20260201215247.677121-2-thorsten.blum@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-02	Merge tag 'cgroup-for-6.19-rc8-fixes' of ↵	Linus Torvalds	-7/+63
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Three dmem fixes from Chen Ridong addressing use-after-free, RCU warning, and NULL pointer dereference issues introduced with the dmem controller. All changes are confined to kernel/cgroup/dmem.c and can only affect dmem controller users" * tag 'cgroup-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/dmem: avoid pool UAF cgroup/dmem: avoid rcu warning when unregister region cgroup/dmem: fix NULL pointer dereference when setting max
2026-02-02	uprobes: Fix incorrect lockdep condition in filter_chain()	Breno Leitao	-1/+1
	The list_for_each_entry_rcu() in filter_chain() uses rcu_read_lock_trace_held() as the lockdep condition, but the function holds consumer_rwsem, not the RCU trace lock. This gives me the following output when running with some locking debug option enabled: kernel/events/uprobes.c:1141 RCU-list traversed in non-reader section!! filter_chain register_for_each_vma uprobe_unregister_nosync __probe_event_disable Remove the incorrect lockdep condition since the rwsem provides sufficient protection for the list traversal. Fixes: cc01bd044e6a ("uprobes: travers uprobe's consumer list locklessly under SRCU protection") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260128-uprobe_rcu-v2-1-994ea6d32730@debian.org
2026-02-02	cgroup/dmem: avoid pool UAF	Chen Ridong	-2/+58
	An UAF issue was observed: BUG: KASAN: slab-use-after-free in page_counter_uncharge+0x65/0x150 Write of size 8 at addr ffff888106715440 by task insmod/527 CPU: 4 UID: 0 PID: 527 Comm: insmod 6.19.0-rc7-next-20260129+ #11 Tainted: [O]=OOT_MODULE Call Trace: <TASK> dump_stack_lvl+0x82/0xd0 kasan_report+0xca/0x100 kasan_check_range+0x39/0x1c0 page_counter_uncharge+0x65/0x150 dmem_cgroup_uncharge+0x1f/0x260 Allocated by task 527: Freed by task 0: The buggy address belongs to the object at ffff888106715400 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 64 bytes inside of freed 512-byte region [ffff888106715400, ffff888106715600) The buggy address belongs to the physical page: Memory state around the buggy address: ffff888106715300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888106715380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff888106715400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888106715480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888106715500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb The issue occurs because a pool can still be held by a caller after its associated memory region is unregistered. The current implementation frees the pool even if users still hold references to it (e.g., before uncharge operations complete). This patch adds a reference counter to each pool, ensuring that a pool is only freed when its reference count drops to zero. Fixes: b168ed458dde ("kernel/cgroup: Add "dmem" memory accounting cgroup") Cc: stable@vger.kernel.org # v6.14+ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-02	cgroup/dmem: avoid rcu warning when unregister region	Chen Ridong	-5/+2
	A warnning was detected: WARNING: suspicious RCU usage 6.19.0-rc7-next-20260129+ #1101 Tainted: G O kernel/cgroup/dmem.c:456 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by insmod/532: #0: ffffffff85e78b38 (dmemcg_lock){+.+.}-dmem_cgroup_unregister_region+ stack backtrace: CPU: 2 UID: 0 PID: 532 Comm: insmod Tainted: 6.19.0-rc7-next- Tainted: [O]=OOT_MODULE Call Trace: <TASK> dump_stack_lvl+0xb0/0xd0 lockdep_rcu_suspicious+0x151/0x1c0 dmem_cgroup_unregister_region+0x1e2/0x380 ? __pfx_dmem_test_init+0x10/0x10 [dmem_uaf] dmem_test_init+0x65/0xff0 [dmem_uaf] do_one_initcall+0xbb/0x3a0 The macro list_for_each_rcu() must be used within an RCU read-side critical section (between rcu_read_lock() and rcu_read_unlock()). Using it outside that context, as seen in dmem_cgroup_unregister_region(), triggers the lockdep warning because the RCU protection is not guaranteed. Replace list_for_each_rcu() with list_for_each_entry_safe(), which is appropriate for traversal under spinlock protection where nodes may be deleted. Fixes: b168ed458dde ("kernel/cgroup: Add "dmem" memory accounting cgroup") Cc: stable@vger.kernel.org # v6.14+ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-02	cgroup/dmem: fix NULL pointer dereference when setting max	Chen Ridong	-0/+3
	An issue was triggered: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 15 UID: 0 PID: 658 Comm: bash Tainted: 6.19.0-rc6-next-2026012 Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), RIP: 0010:strcmp+0x10/0x30 RSP: 0018:ffffc900017f7dc0 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff888107cd4358 RDX: 0000000019f73907 RSI: ffffffff82cc381a RDI: 0000000000000000 RBP: ffff8881016bef0d R08: 000000006c0e7145 R09: 0000000056c0e714 R10: 0000000000000001 R11: ffff888107cd4358 R12: 0007ffffffffffff R13: ffff888101399200 R14: ffff888100fcb360 R15: 0007ffffffffffff CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000105c79000 CR4: 00000000000006f0 Call Trace: <TASK> dmemcg_limit_write.constprop.0+0x16d/0x390 ? __pfx_set_resource_max+0x10/0x10 kernfs_fop_write_iter+0x14e/0x200 vfs_write+0x367/0x510 ksys_write+0x66/0xe0 do_syscall_64+0x6b/0x390 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f42697e1887 It was trriggered setting max without limitation, the command is like: "echo test/region0 > dmem.max". To fix this issue, add check whether options is valid after parsing the region_name. Fixes: b168ed458dde ("kernel/cgroup: Add "dmem" memory accounting cgroup") Cc: stable@vger.kernel.org # v6.14+ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-02	ftrace: Fix direct_functions leak in update_ftrace_direct_del	Jiri Olsa	-0/+1
	Alexei reported memory leak in update_ftrace_direct_del. We miss cleanup of the replaced direct_functions in the success path in update_ftrace_direct_del, adding that. Fixes: 8d2c1233f371 ("ftrace: Add update_ftrace_direct_del function") Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Closes: https://lore.kernel.org/bpf/aX_BxG5EJTJdCMT9@krava/T/#m7c13f5a95f862ed7ab78e905fbb678d635306a0c Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20260202075849.1684369-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-02	time/kunit: Document handling of negative years of is_leap()	Mark Brown	-1/+3
	The code local is_leap() helper was tried to be replaced by the RTC is_leap_year() function. Unfortunately the two aren't exactly equivalent, as the kunit variant uses a signed value for the year and the RTC an unsigned one. Since the KUnit tests cover a 16000 year range around the epoch they use year values that are very comfortably negative and hence get mishandled when passed into is_leap_year(). The change was reverted, so add a comment which prevents further attempts to do so. [ tglx: Adapted to the revert ] Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260130-kunit-fix-leap-year-v1-1-92ddf55dffd7@kernel.org
2026-02-02	dma: contiguous: Check return value of dma_contiguous_reserve_area()	Shanker Donthineni	-4/+6
	Commit 8f1fc1bf1a3d ("dma: contiguous: Reserve default CMA heap") introduced a bug where dma_heap_cma_register_heap() is called with a NULL pointer when dma_contiguous_reserve_area() fails to reserve the CMA area. When dma_contiguous_reserve_area() fails, dma_contiguous_default_area remains NULL (initialized as a global variable), but the code doesn't check the return value and proceeds to call dma_heap_cma_register_heap() with this NULL pointer. Later during boot, add_cma_heaps() iterates through the dma_areas[] array and attempts to register heaps. When it encounters the NULL pointer stored by the earlier call, it crashes in __add_cma_heap() -> dma_heap_add() when trying to dereference the NULL CMA pointer. The crash manifests as: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000038 ... Call trace: dma_heap_add+0x40/0x2b0 __add_cma_heap+0x80/0xe0 add_cma_heaps+0x64/0xb0 do_one_initcall+0x60/0x318 kernel_init_freeable+0x260/0x2f0 kernel_init+0x2c/0x168 ret_from_fork+0x10/0x20 Fix this by checking the return value of dma_contiguous_reserve_area() and only calling dma_heap_cma_register_heap() when the reservation succeeds. Fixes: 8f1fc1bf1a3d ("dma: contiguous: Reserve default CMA heap") Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260129181317.2429196-1-sdonthineni@nvidia.com
2026-02-01	Merge tag 'perf-urgent-2026-02-01' of ↵	Linus Torvalds	-4/+4
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events fix from Ingo Molnar: "Fix a race in the user-callchains code" * tag 'perf-urgent-2026-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: sched: Fix perf crash with new is_user_task() helper
2026-02-01	Merge tag 'sched-urgent-2026-02-01' of ↵	Linus Torvalds	-0/+12
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a regression in the deferrable dl_server code that can cause the dl_server to be stuck" * tag 'sched-urgent-2026-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix 'stuck' dl_server
2026-02-01	cpuset: fix overlap of partition effective CPUs	Chen Ridong	-13/+6
	A warning was detect: WARNING: kernel/cgroup/cpuset.c:825 at rebuild_sched_domains_locked Modules linked in: CPU: 12 UID: 0 PID: 681 Comm: rmdir 6.19.0-rc6-next-20260121+ RIP: 0010:rebuild_sched_domains_locked+0x309/0x4b0 RSP: 0018:ffffc900019bbd28 EFLAGS: 00000202 RAX: ffff888104413508 RBX: 0000000000000008 RCX: ffff888104413510 RDX: ffff888109b5f400 RSI: 000000000000ffcf RDI: 0000000000000001 RBP: 0000000000000002 R08: ffff888104413508 R09: 0000000000000002 R10: ffff888104413508 R11: 0000000000000001 R12: ffff888104413500 R13: 0000000000000002 R14: ffffc900019bbd78 R15: 0000000000000000 FS: 00007fe274b8d740(0000) GS:ffff8881b6b3c000(0000) knlGS: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe274c98b50 CR3: 00000001047a9000 CR4: 00000000000006f0 Call Trace: <TASK> update_prstate+0x1c7/0x580 cpuset_css_killed+0x2f/0x50 kill_css+0x32/0x180 cgroup_destroy_locked+0xa7/0x200 cgroup_rmdir+0x28/0x100 kernfs_iop_rmdir+0x4c/0x80 vfs_rmdir+0x12c/0x280 filename_rmdir+0x19e/0x200 __x64_sys_rmdir+0x23/0x40 do_syscall_64+0x6b/0x390 It can be reproduced by steps: # cd /sys/fs/cgroup/ # mkdir A1 # mkdir B1 # mkdir C1 # echo 1-3 > A1/cpuset.cpus # echo root > A1/cpuset.cpus.partition # echo 3-5 > B1/cpuset.cpus # echo root > B1/cpuset.cpus.partition # echo 6 > C1/cpuset.cpus # echo root > C1/cpuset.cpus.partition # rmdir A1/ # rmdir C1/ Both A1 and B1 were initially configured with CPU 3, which was exclusively assigned to A1's partition. When A1 was removed, CPU 3 was returned to the root pool. However, B1 incorrectly regained access to CPU 3 when update_cpumasks_hier was triggered during C1's removal, which also updated sibling configurations. The update_sibling_cpumasks function was called to synchronize siblings' effective CPUs due to changes in their parent's effective CPUs. However, parent effective CPU changes should not affect partition-effective CPUs. To fix this issue, update_cpumasks_hier should only be invoked when the sibling is not a valid partition in the update_sibling_cpumasks. Fixes: 2a3602030d80 ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-01	cgroup: increase maximum subsystem count from 16 to 32	Chen Ridong	-34/+34
	The current cgroup subsystem limit of 16 is insufficient, as the number of existing subsystems has already reached this limit. When adding a new subsystem that is not yet in the mainline kernel, building with `make allmodconfig` requires first bypassing the `BUILD_BUG_ON(CGROUP_SUBSYS_COUNT > 16)` restriction to allow compilation to succeed. However, the kernel still fails to boot afterward. This patch increases the maximum number of supported cgroup subsystems from 16 to 32, providing enough room for future subsystem additions. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Tested-by: JP Kobryn <inwardvessel@gmail.com> Acked-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-31	kho: skip memoryless NUMA nodes when reserving scratch areas	Evangelos Petrongonas	-2/+6
	kho_reserve_scratch() iterates over all online NUMA nodes to allocate per-node scratch memory. On systems with memoryless NUMA nodes (nodes that have CPUs but no memory), memblock_alloc_range_nid() fails because there is no memory available on that node. This causes KHO initialization to fail and kho_enable to be set to false. Some ARM64 systems have NUMA topologies where certain nodes contain only CPUs without any associated memory. These configurations are valid and should not prevent KHO from functioning. Fix this by only counting nodes that have memory (N_MEMORY state) and skip memoryless nodes in the per-node scratch allocation loop. Link: https://lkml.kernel.org/r/20260120175913.34368-1-epetron@amazon.de Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers"). Signed-off-by: Evangelos Petrongonas <epetron@amazon.de> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Graf <graf@amazon.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31	crash_dump: fix dm_crypt keys locking and ref leak	Vasily Gorbik	-4/+13
	crash_load_dm_crypt_keys() reads dm-crypt volume keys from the user keyring. It uses user_key_payload_locked() without holding key->sem, which makes lockdep complain when kexec_file_load() assembles the crash image: ============================= WARNING: suspicious RCU usage ----------------------------- ./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by kexec/4875. stack backtrace: Call Trace: <TASK> dump_stack_lvl+0x5d/0x80 lockdep_rcu_suspicious.cold+0x4e/0x96 crash_load_dm_crypt_keys+0x314/0x390 bzImage64_load+0x116/0x9a0 ? __lock_acquire+0x464/0x1ba0 __do_sys_kexec_file_load+0x26a/0x4f0 do_syscall_64+0xbd/0x430 entry_SYSCALL_64_after_hwframe+0x77/0x7f In addition, the key returned by request_key() is never key_put()'d, leaking a key reference on each load attempt. Take key->sem while copying the payload and drop the key reference afterwards. Link: https://lkml.kernel.org/r/patch.git-2d4d76083a5c.your-ad-here.call-01769426386-ext-2560@work.hours Fixes: 479e58549b0f ("crash_dump: store dm crypt keys in kdump reserved memory") Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31	kho: cleanup error handling in kho_populate()	Mike Rapoport (Microsoft)	-22/+17
	* use dedicated labels for error handling instead of checking if a pointer is not null to decide if it should be unmapped * drop assignment of values to err that are only used to print a numeric error code, there are pr_warn()s for each failure already so printing a numeric error code in the next line does not add anything useful Link: https://lkml.kernel.org/r/20260122121757.575987-1-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31	ucount: check for CAP_SYS_RESOURCE using ns_capable_noaudit()	Ondrej Mosnacek	-1/+1
	The user.* sysctls implement the ctl_table_root::permissions hook and they override the file access mode based on the CAP_SYS_RESOURCE capability (at most rwx if capable, at most r-- if not). The capability is being checked unconditionally, so if an LSM denies the capability, an audit record may be logged even when access is in fact granted. Given the logic in the set_permissions() function in kernel/ucount.c and the unfortunate way the permission checking is implemented, it doesn't seem viable to avoid false positive denials by deferring the capability check. Thus, do the same as in net_ctl_permissions() (net/sysctl_net.c) - switch from ns_capable() to ns_capable_noaudit(), so that the check never logs an audit record. Link: https://lkml.kernel.org/r/20260122140745.239428-1-omosnace@redhat.com Fixes: dbec28460a89 ("userns: Add per user namespace sysctls.") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Acked-by: Serge Hallyn <serge@hallyn.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexey Gladkov <legion@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31	kexec: derive purgatory entry from symbol	Li Chen	-57/+74
	kexec_load_purgatory() derives image->start by locating e_entry inside an SHF_EXECINSTR section. If the purgatory object contains multiple executable sections with overlapping sh_addr, the entrypoint check can match more than once and trigger a WARN. Derive the entry section from the purgatory_start symbol when present and compute image->start from its final placement. Keep the existing e_entry fallback for purgatories that do not expose the symbol. WARNING: kernel/kexec_file.c:1009 at kexec_load_purgatory+0x395/0x3c0, CPU#10: kexec/1784 Call Trace: <TASK> bzImage64_load+0x133/0xa00 __do_sys_kexec_file_load+0x2b3/0x5c0 do_syscall_64+0x81/0x610 entry_SYSCALL_64_after_hwframe+0x76/0x7e [me@linux.beauty: move helper to avoid forward declaration, per Baoquan] Link: https://lkml.kernel.org/r/20260128043511.316860-1-me@linux.beauty Link: https://lkml.kernel.org/r/20260120124005.148381-1-me@linux.beauty Fixes: 8652d44f466a ("kexec: support purgatories with .text.hot sections") Signed-off-by: Li Chen <me@linux.beauty> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Graf <graf@amazon.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Li Chen <me@linux.beauty> Cc: Philipp Rudo <prudo@redhat.com> Cc: Ricardo Ribalda Delgado <ribalda@chromium.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31	delayacct: add timestamp of delay max	Wang Yaxin	-11/+28
	Problem ======= Commit 658eb5ab916d ("delayacct: add delay max to record delay peak") introduced the delay max for getdelays, which records abnormal latency peaks and helps us understand the magnitude of such delays. However, the peak latency value alone is insufficient for effective root cause analysis. Without the precise timestamp of when the peak occurred, we still lack the critical context needed to correlate it with other system events. Solution ======== To address this, we need to additionally record a precise timestamp when the maximum latency occurs. By correlating this timestamp with system logs and monitoring metrics, we can identify processes with abnormal resource usage at the same moment, which can help us to pinpoint root causes. Use Case ======== bash-4.4# ./getdelays -d -t 227 print delayacct stats ON TGID 227 CPU count real total virtual total delay total delay average delay max delay min delay max timestamp 46 188000000 192348334 4098012 0.089ms 0.429260ms 0.051205ms 2026-01-15T15:06:58 IO count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A SWAP count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A RECLAIM count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A THRAS HING count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A COMPACT count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A WPCOPY count delay total delay average delay max delay min delay max timestamp 182 19413338 0.107ms 0.547353ms 0.022462ms 2026-01-15T15:05:24 IRQ count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A Link: https://lkml.kernel.org/r/20260119100241520gWubW8-5QfhSf9gjqcc_E@zte.com.cn Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31	tracing: remove size parameter in __trace_puts()	Steven Rostedt	-5/+4
	The __trace_puts() function takes a string pointer and the size of the string itself. All users currently simply pass in the strlen() of the string it is also passing in. There's no reason to pass in the size. Instead have the __trace_puts() function do the strlen() within the function itself. This fixes a header recursion issue where using strlen() in the macro calling __trace_puts() requires adding #include <linux/string.h> in order to use strlen(). Removing the use of strlen() from the header fixes the recursion issue. Link: https://lore.kernel.org/all/aUN8Hm377C5A0ILX@yury/ Link: https://lkml.kernel.org/r/20260116042510.241009-6-ynorov@nvidia.com Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31	kho: simplify page initialization in kho_restore_page()	Pratyush Yadav	-14/+26
	When restoring a page (from kho_restore_pages()) or folio (from kho_restore_folio()), KHO must initialize the struct page. The initialization differs slightly depending on if a folio is requested or a set of 0-order pages is requested. Conceptually, it is quite simple to understand. When restoring 0-order pages, each page gets a refcount of 1 and that's it. When restoring a folio, head page gets a refcount of 1 and tail pages get 0. kho_restore_page() tries to combine the two separate initialization flow into one piece of code. While it works fine, it is more complicated to read than it needs to be. Make the code simpler by splitting the two initalization paths into two separate functions. This improves readability by clearly showing how each type must be initialized. Link: https://lkml.kernel.org/r/20260116112217.915803-3-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Graf <graf@amazon.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31	kho: use unsigned long for nr_pages	Pratyush Yadav	-5/+6
	Patch series "kho: clean up page initialization logic", v2. This series simplifies the page initialization logic in kho_restore_page(). It was originally only a single patch [0], but on Pasha's suggestion, I added another patch to use unsigned long for nr_pages. Technically speaking, the patches aren't related and can be applied independently, but bundling them together since patch 2 relies on 1 and it is easier to manage them this way. This patch (of 2): With 4k pages, a 32-bit nr_pages can span up to 16 TiB. While it is a lot, there exist systems with terabytes of RAM. gup is also moving to using long for nr_pages. Use unsigned long and make KHO future-proof. Link: https://lkml.kernel.org/r/20260116112217.915803-1-pratyush@kernel.org Link: https://lkml.kernel.org/r/20260116112217.915803-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Graf <graf@amazon.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31	Merge branch 'mm-hotfixes-stable' into mm-nonmm-stable to pick up changes	Andrew Morton	-2/+16
	required to merge "kho: use unsigned long for nr_pages".
2026-01-31	mm, swap: cleanup swap entry management workflow	Kairui Song	-4/+6
	The current swap entry allocation/freeing workflow has never had a clear definition. This makes it hard to debug or add new optimizations. This commit introduces a proper definition of how swap entries would be allocated and freed. Now, most operations are folio based, so they will never exceed one swap cluster, and we now have a cleaner border between swap and the rest of mm, making it much easier to follow and debug, especially with new added sanity checks. Also making more optimization possible. Swap entry will be mostly freed and free with a folio bound. The folio lock will be useful for resolving many swap related races. Now swap allocation (except hibernation) always starts with a folio in the swap cache, and gets duped/freed protected by the folio lock: - folio_alloc_swap() - The only allocation entry point now. Context: The folio must be locked. This allocates one or a set of continuous swap slots for a folio and binds them to the folio by adding the folio to the swap cache. The swap slots' swap count start with zero value. - folio_dup_swap() - Increase the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This increases the ref count of swap entries allocated to a folio. Newly allocated swap slots' count has to be increased by this helper as the folio got unmapped (and swap entries got installed). - folio_put_swap() - Decrease the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This decreases the ref count of swap entries allocated to a folio. Typically, swapin will decrease the swap count as the folio got installed back and the swap entry got uninstalled This won't remove the folio from the swap cache and free the slot. Lazy freeing of swap cache is helpful for reducing IO. There is already a folio_free_swap() for immediate cache reclaim. This part could be further optimized later. The above locking constraints could be further relaxed when the swap table is fully implemented. Currently dup still needs the caller to lock the swap entry container (e.g. PTL), or a concurrent zap may underflow the swap count. Some swap users need to interact with swap count without involving folio (e.g. forking/zapping the page table or mapping truncate without swapin). In such cases, the caller has to ensure there is no race condition on whatever owns the swap count and use the below helpers: - swap_put_entries_direct() - Decrease the swap count directly. Context: The caller must lock whatever is referencing the slots to avoid a race. Typically the page table zapping or shmem mapping truncate will need to free swap slots directly. If a slot is cached (has a folio bound), this will also try to release the swap cache. - swap_dup_entry_direct() - Increase the swap count directly. Context: The caller must lock whatever is referencing the entries to avoid race, and the entries must already have a swap count > 1. Typically, forking will need to copy the page table and hence needs to increase the swap count of the entries in the table. The page table is locked while referencing the swap entries, so the entries all have a swap count > 1 and can't be freed. Hibernation subsystem is a bit different, so two special wrappers are here: - swap_alloc_hibernation_slot() - Allocate one entry from one device. - swap_free_hibernation_slot() - Free one entry allocated by the above helper. All hibernation entries are exclusive to the hibernation subsystem and should not interact with ordinary swap routines. By separating the workflows, it will be possible to bind folio more tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary pin. This commit should not introduce any behavior change [kasong@tencent.com: fix leak, per Chris Mason. Remove WARN_ON, per Lai Yi] Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com [ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris] Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4 Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Kairui Song <ryncsn@gmail.com> Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Chris Mason <clm@meta.com> Cc: Lai Yi <yi1.lai@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31	Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm/shmem,	Andrew Morton	-2/+16
	swap: fix race of truncate and swap entry split", needed for merging "mm, swap: cleanup swap entry management workflow".
2026-01-31	bpf: Add bpf_jit_supports_fsession()	Leon Hwang	-0/+10
	The added fsession does not prevent running on those architectures, that haven't added fsession support. For example, try to run fsession tests on arm64: test_fsession_basic:PASS:fsession_test__open_and_load 0 nsec test_fsession_basic:PASS:fsession_attach 0 nsec check_result:FAIL:test_run_opts err unexpected error: -14 (errno 14) In order to prevent such errors, add bpf_jit_supports_fsession() to guard those architectures. Fixes: 2d419c44658f ("bpf: add fsession support") Acked-by: Puranjay Mohan <puranjay@kernel.org> Tested-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260131144950.16294-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-30	bpf: Consolidate special map field validation in verifier	Mykyta Yatsenko	-59/+11
	Consolidate all logic for verifying special map fields in the single function check_map_field_pointer(). Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260130-verif_special_fields-v2-2-2c59e637da7d@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-30	bpf: Introduce struct bpf_map_desc in verifier	Mykyta Yatsenko	-39/+40
	Introduce struct bpf_map_desc to hold bpf_map pointer and map uid. Use this struct in both bpf_call_arg_meta and bpf_kfunc_call_arg_meta instead of having different representations: - bpf_call_arg_meta had separate map_ptr and map_uid fields - bpf_kfunc_call_arg_meta had an anonymous inline struct This unifies the map fields layout across both metadata structures, making the code more consistent and preparing for further refactoring of map field pointer validation. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260130-verif_special_fields-v2-1-2c59e637da7d@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-30	perf: sched: Fix perf crash with new is_user_task() helper	Steven Rostedt	-4/+4
	In order to do a user space stacktrace the current task needs to be a user task that has executed in user space. It use to be possible to test if a task is a user task or not by simply checking the task_struct mm field. If it was non NULL, it was a user task and if not it was a kernel task. But things have changed over time, and some kernel tasks now have their own mm field. An idea was made to instead test PF_KTHREAD and two functions were used to wrap this check in case it became more complex to test if a task was a user task or not[1]. But this was rejected and the C code simply checked the PF_KTHREAD directly. It was later found that not all kernel threads set PF_KTHREAD. The io-uring helpers instead set PF_USER_WORKER and this needed to be added as well. But checking the flags is still not enough. There's a very small window when a task exits that it frees its mm field and it is set back to NULL. If perf were to trigger at this moment, the flags test would say its a user space task but when perf would read the mm field it would crash with at NULL pointer dereference. Now there are flags that can be used to test if a task is exiting, but they are set in areas that perf may still want to profile the user space task (to see where it exited). The only real test is to check both the flags and the mm field. Instead of making this modification in every location, create a new is_user_task() helper function that does all the tests needed to know if it is safe to read the user space memory or not. [1] https://lore.kernel.org/all/20250425204120.639530125@goodmis.org/ Fixes: 90942f9fac05 ("perf: Use current->flags & PF_KTHREAD\|PF_USER_WORKER instead of current->mm == NULL") Closes: https://lore.kernel.org/all/0d877e6f-41a7-4724-875d-0b0a27b8a545@roeck-us.net/ Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260129102821.46484722@gandalf.local.home
2026-01-30	sched/deadline: Fix 'stuck' dl_server	Peter Zijlstra	-0/+12
	Andrea reported the dl_server getting stuck for him. He tracked it down to a state where dl_server_start() saw dl_defer_running==1, but the dl_server's job is no longer valid at the time of dl_server_start(). In the state diagram this corresponds to [4] D->A (or dl_server_stop() due to no more runnable tasks) followed by [1], which in case of a lapsed deadline must then be A->B. Now our A has dl_defer_running==1, while B demands dl_defer_running==0, therefore it must get cleared when the CBS wakeup rules demand a replenish. Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Reported-by: Andrea Righi arighi@nvidia.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Andrea Righi arighi@nvidia.com Link: https://lkml.kernel.org/r/20260123161645.2181752-1-arighi@nvidia.com Link: https://patch.msgid.link/20260130124100.GC1079264@noisy.programming.kicks-ass.net
2026-01-30	Merge tag 'dma-mapping-6.19-2026-01-30' of ↵	Linus Torvalds	-7/+16
	git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: - important fix for ARM 32-bit based systems using cma= kernel parameter (Oreoluwa Babatunde) - a fix for the corner case of the DMA atomic pool based allocations (Sai Sree Kartheek Adivi) * tag 'dma-mapping-6.19-2026-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma/pool: distinguish between missing and exhausted atomic pools of: reserved_mem: Allow reserved_mem framework detect "cma=" kernel param
2026-01-30	tick/nohz: Optimize check_tick_dependency() with early return	Ionut Nechita (Sunlight Linux)	-0/+3
	There is no point in iterating through individual tick dependency bits when the tick_stop tracepoint is disabled, which is the common case. When the trace point is disabled, return immediately based on the atomic value being zero or non-zero, skipping the per-bit evaluation. This optimization improves the hot path performance of tick dependency checks across all contexts (idle and non-idle), not just nohz_full CPUs. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ionut Nechita (Sunlight Linux) <sunlightlinux@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260128074558.15433-3-sunlightlinux@gmail.com
2026-01-30	bpf: Allow sleepable programs to use tail calls	Jiri Olsa	-1/+4
	Allowing sleepable programs to use tail calls. Making sure we can't mix sleepable and non-sleepable bpf programs in tail call map (BPF_MAP_TYPE_PROG_ARRAY) and allowing it to be used in sleepable programs. Sleepable programs can be preempted and sleep which might bring new source of race conditions, but both direct and indirect tail calls should not be affected. Direct tail calls work by patching direct jump to callee into bpf caller program, so no problem there. We atomically switch from nop to jump instruction. Indirect tail call reads the callee from the map and then jumps to it. The callee bpf program can't disappear (be released) from the caller, because it is executed under rcu lock (rcu_read_lock_trace). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260130081208.1130204-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-30	tracing: Add kerneldoc to trace_event_buffer_reserve()	Steven Rostedt	-0/+16
	Add a appropriate kerneldoc to trace_event_buffer_reserve() to make it easier to understand how that function is used. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260130103745.1126e4af@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-30	tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast	Steven Rostedt	-4/+14
	The current use of guard(preempt_notrace)() within __DECLARE_TRACE() to protect invocation of __DO_TRACE_CALL() means that BPF programs attached to tracepoints are non-preemptible. This is unhelpful in real-time systems, whose users apparently wish to use BPF while also achieving low latencies. (Who knew?) One option would be to use preemptible RCU, but this introduces many opportunities for infinite recursion, which many consider to be counterproductive, especially given the relatively small stacks provided by the Linux kernel. These opportunities could be shut down by sufficiently energetic duplication of code, but this sort of thing is considered impolite in some circles. Therefore, use the shiny new SRCU-fast API, which provides somewhat faster readers than those of preemptible RCU, at least on Paul E. McKenney's laptop, where task_struct access is more expensive than access to per-CPU variables. And SRCU-fast provides way faster readers than does SRCU, courtesy of being able to avoid the read-side use of smp_mb(). Also, it is quite straightforward to create srcu_read_{,un}lock_fast_notrace() functions. Link: https://lore.kernel.org/all/20250613152218.1924093-1-bigeasy@linutronix.de/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Alexei Starovoitov <ast@kernel.org> Link: https://patch.msgid.link/20260126231256.499701982@kernel.org Co-developed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-30	srcu: Fix warning to permit SRCU-fast readers in NMI handlers	Paul E. McKenney	-1/+2
	SRCU-fast is designed to be used in NMI handlers, even going so far as to use atomic operations for architectures supporting NMIs but not providing NMI-safe per-CPU atomic operations. However, the WARN_ON_ONCE() in __srcu_check_read_flavor() complains if SRCU-fast is used in an NMI handler. This commit therefore modifies that WARN_ON_ONCE() to avoid such complaints. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Boqun Feng <boqun@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Link: https://patch.msgid.link/8232efe8-a7a3-446c-af0b-19f9b523b4f7@paulmck-laptop Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-30	bpf: Have __bpf_trace_run() use rcu_read_lock_dont_migrate()	Steven Rostedt	-3/+2
	In order to switch the protection of tracepoint callbacks from preempt_disable() to srcu_read_lock_fast() the BPF callback from tracepoints needs to have migration prevention as the BPF programs expect to stay on the same CPU as they execute. Put together the RCU protection with migration prevention and use rcu_read_lock_dont_migrate() in __bpf_trace_run(). This will allow tracepoints callbacks to be preemptible. Link: https://lore.kernel.org/all/CAADnVQKvY026HSFGOsavJppm3-Ajm-VsLzY-OeFUe+BaKMRnDg@mail.gmail.com/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Alexei Starovoitov <ast@kernel.org> Link: https://patch.msgid.link/20260126231256.335034877@kernel.org Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-30	Merge branch 'core/entry' into sched/core	Thomas Gleixner	-105/+10
	Pull the entry update to avoid merge conflicts with the time slice extension changes. Signed-off-by: Thomas Gleixner <tglx@kernel.org>
2026-01-30	entry: Inline syscall_exit_work() and syscall_trace_enter()	Jinjie Ruan	-97/+10
	After switching ARM64 to the generic entry code, a syscall_exit_work() appeared as a profiling hotspot because it is not inlined. Inlining both syscall_trace_enter() and syscall_exit_work() provides a performance gain when any of the work items is enabled. With audit enabled this results in a ~4% performance gain for perf bench basic syscall on a kunpeng920 system: \| Metric \| Baseline \| Inlined \| Change \| \| ---------- \| ----------- \| ----------- \| ------ \| \| Total time \| 2.353 [sec] \| 2.264 [sec] \| ↓3.8% \| \| usecs/op \| 0.235374 \| 0.226472 \| ↓3.8% \| \| ops/sec \| 4,248,588 \| 4,415,554 \| ↑3.9% \| Small gains can be observed on x86 as well, though the generated code optimizes for the work case, which is counterproductive for high performance scenarios where such entry/exit work is usually avoided. Avoid this by marking the work check in syscall_enter_from_user_mode_work() unlikely, which is what the corresponding check in the exit path does already. [ tglx: Massage changelog and add the unlikely() ] Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260128031934.3906955-14-ruanjinjie@huawei.com
2026-01-30	entry: Add arch_ptrace_report_syscall_entry/exit()	Jinjie Ruan	-2/+2
	ARM64 requires a architecture specific ptrace wrapper as it needs to save and restore scratch registers. Provide arch_ptrace_report_syscall_entry/exit() wrappers which fall back to ptrace_report_syscall_entry/exit() if the architecture does not provide them. No functional change intended. [ tglx: Massaged changelog and comments ] Suggested-by: Mark Rutland <mark.rutland@arm.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Link: https://patch.msgid.link/20260128031934.3906955-11-ruanjinjie@huawei.com
2026-01-30	entry: Remove unused syscall argument from syscall_trace_enter()	Jinjie Ruan	-3/+2
	The 'syscall' argument of syscall_trace_enter() is immediately overwritten before any real use and serves only as a local variable, so drop the parameter. No functional change intended. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260128031934.3906955-2-ruanjinjie@huawei.com
2026-01-30	genirq/proc: Replace snprintf with strscpy in register_handler_proc	Thorsten Blum	-1/+2
	Replace snprintf("%s", ...) with the faster and more direct strscpy(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260127224949.441391-2-thorsten.blum@linux.dev
2026-01-30	kprobes: Use dedicated kthread for kprobe optimizer	Masami Hiramatsu (Google)	-20/+86
	Instead of using generic workqueue, use a dedicated kthread for optimizing kprobes, because it can wait (sleep) for a long time inside the process by synchronize_rcu_task(). This means other works can be stopped until it finishes. Link: https://lore.kernel.org/all/176970170302.114949.5175231591310436910.stgit@devnote2/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-01-29	genirq/redirect: Prevent writing MSI message on affinity change	Thomas Gleixner	-1/+1
	The interrupts which are handled by the redirection infrastructure provide a irq_set_affinity() callback, which solely determines the target CPU for redirection via irq_work and und updates the effective affinity mask. Contrary to regular MSI interrupts this affinity setting does not change the underlying interrupt message as the message is only created at setup time to deliver to the demultiplexing interrupt. Therefore the message write in msi_domain_set_affinity() is a pointless exercise. In principle the write is harmless, but a Tegra system exposes a full system hang during suspend due to that write. It's unclear why the check for the PCI device state PCI_D0 in pci_msi_domain_write_msg(), which prevents the actual hardware access if a device is in powered down state, fails on this particular system, but that's a different problem which needs to be investigated by the Tegra experts. The irq_set_affinity() callback can advise msi_domain_set_affinity() not to write the MSI message by returning IRQ_SET_MASK_OK_DONE instead of IRQ_SET_MASK_OK. Do exactly that. Just to make it clear again: This is not a correctness issue of the redirection code as returning IRQ_SET_MASK_OK in that context is completely correct. From the core code point of view this is solely a optimization to avoid an redundant hardware write. As a byproduct it papers over the underlying problem on the Tegra platform, which fails to put the PCIe device[s] out of PCI_D0 despite the fact that the devices and busses have been shut down. The redirect infrastructure just unearthed the underlying issue, which is prone to happen in quite some other code paths which use the PCI_D0 check to prevent hardware access to powered down devices. This therefore has neither a 'Fixes:' nor a 'Closes:' tag associated as the underlying problem, which is outside the scope of the interrupt code, is still unresolved. Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Jon Hunter <jonathanh@nvidia.com> Link: https://lore.kernel.org/all/4e5b349c-6599-4871-9e3b-e10352ae0ca0@nvidia.com Link: https://patch.msgid.link/87tsw6aglz.ffs@tglx
2026-01-29	Merge tag 'mm-hotfixes-stable-2026-01-29-09-41' of ↵	Linus Torvalds	-2/+16
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes. 9 are cc:stable, 12 are for MM. There's a patch series from Pratyush Yadav which fixes a few things in the new-in-6.19 LUO memfd code. Plus the usual shower of singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-01-29-09-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: vmcoreinfo: make hwerr_data visible for debugging mm/zone_device: reinitialize large zone device private folios mm/mm_init: don't cond_resched() in deferred_init_memmap_chunk() if called from deferred_grow_zone() mm/kfence: randomize the freelist on initialization kho: kho_preserve_vmalloc(): don't return 0 when ENOMEM kho: init alloc tags when restoring pages from reserved memory mm: memfd_luo: restore and free memfd_luo_ser on failure mm: memfd_luo: use memfd_alloc_file() instead of shmem_file_setup() memfd: export alloc_file() flex_proportions: make fprop_new_period() hardirq safe mailmap: add entry for Viacheslav Bocharov mm/memory-failure: teach kill_accessing_process to accept hugetlb tail page pfn mm/memory-failure: fix missing ->mf_stats count in hugetlb poison mm, swap: restore swap_space attr aviod kernel panic mm/kasan: fix KASAN poisoning in vrealloc() mm/shmem, swap: fix race of truncate and swap entry split
2026-01-29	prctl: add arch-agnostic prctl()s for indirect branch tracking	Deepak Gupta	-0/+30
	Three architectures (x86, aarch64, riscv) have support for indirect branch tracking feature in a very similar fashion. On a very high level, indirect branch tracking is a CPU feature where CPU tracks branches which use a memory operand to transfer control. As part of this tracking, during an indirect branch, the CPU expects a landing pad instruction on the target PC, and if not found, the CPU raises some fault (architecture-dependent). x86 landing pad instr - 'ENDBRANCH' arch64 landing pad instr - 'BTI' riscv landing instr - 'lpad' Given that three major architectures have support for indirect branch tracking, this patch creates architecture-agnostic 'prctls' to allow userspace to control this feature. They are: - PR_GET_INDIR_BR_LP_STATUS: Get the current configured status for indirect branch tracking. - PR_SET_INDIR_BR_LP_STATUS: Set the configuration for indirect branch tracking. The following status options are allowed: - PR_INDIR_BR_LP_ENABLE: Enables indirect branch tracking on user thread. - PR_INDIR_BR_LP_DISABLE: Disables indirect branch tracking on user thread. - PR_LOCK_INDIR_BR_LP_STATUS: Locks configured status for indirect branch tracking for user thread. Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Tested-by: Andreas Korb <andreas.korb@aisec.fraunhofer.de> # QEMU, custom CVA6 Tested-by: Valentin Haudiquet <valentin.haudiquet@canonical.com> Link: https://patch.msgid.link/20251112-v5_user_cfi_series-v23-13-b55691eacf4f@rivosinc.com [pjw@kernel.org: cleaned up patch description, code comments] Signed-off-by: Paul Walmsley <pjw@kernel.org>
2026-01-29	dma/pool: distinguish between missing and exhausted atomic pools	Sai Sree Kartheek Adivi	-1/+6
	Currently, dma_alloc_from_pool() unconditionally warns and dumps a stack trace when an allocation fails, with the message "Failed to get suitable pool". This conflates two distinct failure modes: 1. Configuration error: No atomic pool is available for the requested DMA mask (a fundamental system setup issue) 2. Resource Exhaustion: A suitable pool exists but is currently full (a recoverable runtime state) This lack of distinction prevents drivers from using __GFP_NOWARN to suppress error messages during temporary pressure spikes, such as when awaiting synchronous reclaim of descriptors. Refactor the error handling to distinguish these cases: - If no suitable pool is found, keep the unconditional WARN regarding the missing pool. - If a pool was found but is exhausted, respect __GFP_NOWARN and update the warning message to explicitly state "DMA pool exhausted". Fixes: 9420139f516d ("dma-pool: fix coherent pool allocations for IOMMU mappings") Signed-off-by: Sai Sree Kartheek Adivi <s-adivi@ti.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260128133554.3056582-1-s-adivi@ti.com
2026-01-28	bpf: Fix verifier_bug_if to account for BPF_CALL	Luis Gerhorst	-6/+8
	The BPF verifier assumes `insn_aux->nospec_result` is only set for direct memory writes (e.g., `(u32)(r1+off) = r2`). However, the assertion fails to account for helper calls (e.g., `bpf_skb_load_bytes_relative`) that perform writes to stack memory. Make the check more precise to resolve this. The problem is that `BPF_CALL` instructions have `BPF_CLASS(insn->code) == BPF_JMP`, which triggers the warning check: - Helpers like `bpf_skb_load_bytes_relative` write to stack memory - `check_helper_call()` loops through `meta.access_size`, calling `check_mem_access(..., BPF_WRITE)` - `check_stack_write()` sets `insn_aux->nospec_result = 1` - Since `BPF_CALL` is encoded as `BPF_JMP \| BPF_CALL`, the warning fires Execution flow: ``` 1. Drop capabilities → Enable Spectre mitigation 2. Load BPF program └─> do_check() ├─> check_cond_jmp_op() → Marks dead branch as speculative │ └─> push_stack(..., speculative=true) ├─> pop_stack() → state->speculative = 1 ├─> check_helper_call() → Processes helper in dead branch │ └─> check_mem_access(..., BPF_WRITE) │ └─> insn_aux->nospec_result = 1 └─> Checks: state->speculative && insn_aux->nospec_result └─> BPF_CLASS(insn->code) == BPF_JMP → WARNING ``` To fix the assert, it would be nice to be able to reuse bpf_insn_successors() here, but bpf_insn_successors()->cnt is not exactly what we want as it may also be 1 for BPF_JA. Instead, we could check opcode_info.can_jump, but then we would have to share the table between the functions. This would mean moving the table out of the function and adding bpf_opcode_info(). As the verifier_bug_if() only runs for insns with nospec_result set, the impact on verification time would likely still be negligible. However, I assume sharing bpf_opcode_info() between liveness.c and verifier.c will not be worth it. It seems as only adjust_jmp_off() could also be simplified using it, and there imm/off is touched. Thus it is maybe better to rely on exact opcode/class matching there. Therefore, to avoid this sharing only for a verifier_bug_if(), just check the opcode. This should now cover all opcodes for which can_jump in bpf_insn_successors() is true. Parts of the description and example are taken from the bug report. Fixes: dadb59104c64 ("bpf: Fix aux usage after do_check_insn()") Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/7678017d-b760-4053-a2d8-a6879b0dbeeb@hust.edu.cn/ Link: https://lore.kernel.org/r/20260127115912.3026761-2-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>