aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-08-24genirq/devres: Add error handling in devm_request_*_irq()Pan Chuang1-40/+87
devm_request_threaded_irq() and devm_request_any_context_irq() currently don't print any error message when interrupt registration fails. This forces each driver to implement redundant error logging - over 2,000 lines of error messages exist across drivers. Additionally, when upper-layer functions propagate these errors without logging, critical debugging information is lost. Add devm_request_result() helper to unify error reporting via dev_err_probe(), Use it in devm_request_threaded_irq() and devm_request_any_context_irq() printing device name, IRQ number, handler functions, and error code on failure automatically. Co-developed-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250805092922.135500-2-panchuang@vivo.com
2025-08-23genirq: Add irq_chip_(startup/shutdown)_parent()Inochi Amaoto1-0/+37
As the MSI controller on SG2044 uses PLIC as the underlying interrupt controller, it needs to call irq_enable() and irq_disable() to startup/shutdown interrupts. Otherwise, the MSI interrupt can not be startup correctly and will not respond any incoming interrupt. Introduce irq_chip_startup_parent() and irq_chip_shutdown_parent() to allow the interrupt controller to call the irq_startup()/irq_shutdown() callbacks of the parent interrupt chip. In case the irq_startup()/irq_shutdown() callbacks are not implemented for the parent interrupt chip, this will fallback to irq_chip_enable_parent() or irq_chip_disable_parent(). Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Inochi Amaoto <inochiama@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Chen Wang <unicorn_wang@outlook.com> # Pioneerbox Reviewed-by: Chen Wang <unicorn_wang@outlook.com> Link: https://lore.kernel.org/all/20250813232835.43458-2-inochiama@gmail.com Link: https://lore.kernel.org/lkml/20250722224513.22125-1-inochiama@gmail.com/
2025-08-23genirq: Remove GENERIC_IRQ_LEGACYSebastian Andrzej Siewior2-11/+0
IA64 is gone and with it the last GENERIC_IRQ_LEGACY user. Remove GENERIC_IRQ_LEGACY. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250814165949.hvtP03r4@linutronix.de
2025-08-23Merge tag 'trace-v6.17-rc2-2' of ↵Linus Torvalds6-25/+49
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix rtla and latency tooling pkg-config errors If libtraceevent and libtracefs is installed, but their corresponding '.pc' files are not installed, it reports that the libraries are missing and confuses the developer. Instead, report that the pkg-config files are missing and should be installed. - Fix overflow bug of the parser in trace_get_user() trace_get_user() uses the parsing functions to parse the user space strings. If the parser fails due to incorrect processing, it doesn't terminate the buffer with a nul byte. Add a "failed" flag to the parser that gets set when parsing fails and is used to know if the buffer is fine to use or not. - Remove a semicolon that was at an end of a comment line - Fix register_ftrace_graph() to unregister the pm notifier on error The register_ftrace_graph() registers a pm notifier but there's an error path that can exit the function without unregistering it. Since the function returns an error, it will never be unregistered. - Allocate and copy ftrace hash for reader of ftrace filter files When the set_ftrace_filter or set_ftrace_notrace files are open for read, an iterator is created and sets its hash pointer to the associated hash that represents filtering or notrace filtering to it. The issue is that the hash it points to can change while the iteration is happening. All the locking used to access the tracer's hashes are released which means those hashes can change or even be freed. Using the hash pointed to by the iterator can cause UAF bugs or similar. Have the read of these files allocate and copy the corresponding hashes and use that as that will keep them the same while the iterator is open. This also simplifies the code as opening it for write already does an allocate and copy, and now that the read is doing the same, there's no need to check which way it was opened on the release of the file, and the iterator hash can always be freed. - Fix function graph to copy args into temp storage The output of the function graph tracer shows both the entry and the exit of a function. When the exit is right after the entry, it combines the two events into one with the output of "function();", instead of showing: function() { } In order to do this, the iterator descriptor that reads the events includes storage that saves the entry event while it peaks at the next event in the ring buffer. The peek can free the entry event so the iterator must store the information to use it after the peek. With the addition of function graph tracer recording the args, where the args are a dynamic array in the entry event, the temp storage does not save them. This causes the args to be corrupted or even cause a read of unsafe memory. Add space to save the args in the temp storage of the iterator. - Fix race between ftrace_dump and reading trace_pipe ftrace_dump() is used when a crash occurs where the ftrace buffer will be printed to the console. But it can also be triggered by sysrq-z. If a sysrq-z is triggered while a task is reading trace_pipe it can cause a race in the ftrace_dump() where it checks if the buffer has content, then it checks if the next event is available, and then prints the output (regardless if the next event was available or not). Reading trace_pipe at the same time can cause it to not be available, and this triggers a WARN_ON in the print. Move the printing into the check if the next event exists or not * tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Also allocate and copy hash for reading of filter files ftrace: Fix potential warning in trace_printk_seq during ftrace_dump fgraph: Copy args in intermediate storage with entry trace/fgraph: Fix the warning caused by missing unregister notifier ring-buffer: Remove redundant semicolons tracing: Limit access to parser->buffer when trace_get_user failed rtla: Check pkg-config install tools/latency-collector: Check pkg-config install
2025-08-22ftrace: Also allocate and copy hash for reading of filter filesSteven Rostedt1-9/+10
Currently the reader of set_ftrace_filter and set_ftrace_notrace just adds the pointer to the global tracer hash to its iterator. Unlike the writer that allocates a copy of the hash, the reader keeps the pointer to the filter hashes. This is problematic because this pointer is static across function calls that release the locks that can update the global tracer hashes. This can cause UAF and similar bugs. Allocate and copy the hash for reading the filter files like it is done for the writers. This not only fixes UAF bugs, but also makes the code a bit simpler as it doesn't have to differentiate when to free the iterator's hash between writers and readers. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20250822183606.12962cc3@batman.local.home Fixes: c20489dad156 ("ftrace: Assign iter->hash to filter or notrace hashes on seq read") Closes: https://lore.kernel.org/all/20250813023044.2121943-1-wutengda@huaweicloud.com/ Closes: https://lore.kernel.org/all/20250822192437.GA458494@ax162/ Reported-by: Tengda Wu <wutengda@huaweicloud.com> Tested-by: Tengda Wu <wutengda@huaweicloud.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22ftrace: Fix potential warning in trace_printk_seq during ftrace_dumpTengda Wu1-2/+2
When calling ftrace_dump_one() concurrently with reading trace_pipe, a WARN_ON_ONCE() in trace_printk_seq() can be triggered due to a race condition. The issue occurs because: CPU0 (ftrace_dump) CPU1 (reader) echo z > /proc/sysrq-trigger !trace_empty(&iter) trace_iterator_reset(&iter) <- len = size = 0 cat /sys/kernel/tracing/trace_pipe trace_find_next_entry_inc(&iter) __find_next_entry ring_buffer_empty_cpu <- all empty return NULL trace_printk_seq(&iter.seq) WARN_ON_ONCE(s->seq.len >= s->seq.size) In the context between trace_empty() and trace_find_next_entry_inc() during ftrace_dump, the ring buffer data was consumed by other readers. This caused trace_find_next_entry_inc to return NULL, failing to populate `iter.seq`. At this point, due to the prior trace_iterator_reset, both `iter.seq.len` and `iter.seq.size` were set to 0. Since they are equal, the WARN_ON_ONCE condition is triggered. Move the trace_printk_seq() into the if block that checks to make sure the return value of trace_find_next_entry_inc() is non-NULL in ftrace_dump_one(), ensuring the 'iter.seq' is properly populated before subsequent operations. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@elte.hu> Link: https://lore.kernel.org/20250822033343.3000289-1-wutengda@huaweicloud.com Fixes: d769041f8653 ("ring_buffer: implement new locking") Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22fgraph: Copy args in intermediate storage with entrySteven Rostedt1-6/+16
The output of the function graph tracer has two ways to display its entries. One way for leaf functions with no events recorded within them, and the other is for functions with events recorded inside it. As function graph has an entry and exit event, to simplify the output of leaf functions it combines the two, where as non leaf functions are separate: 2) | invoke_rcu_core() { 2) | raise_softirq() { 2) 0.391 us | __raise_softirq_irqoff(); 2) 1.191 us | } 2) 2.086 us | } The __raise_softirq_irqoff() function above is really two events that were merged into one. Otherwise it would have looked like: 2) | invoke_rcu_core() { 2) | raise_softirq() { 2) | __raise_softirq_irqoff() { 2) 0.391 us | } 2) 1.191 us | } 2) 2.086 us | } In order to do this merge, the reading of the trace output file needs to look at the next event before printing. But since the pointer to the event is on the ring buffer, it needs to save the entry event before it looks at the next event as the next event goes out of focus as soon as a new event is read from the ring buffer. After it reads the next event, it will print the entry event with either the '{' (non leaf) or ';' and timestamps (leaf). The iterator used to read the trace file has storage for this event. The problem happens when the function graph tracer has arguments attached to the entry event as the entry now has a variable length "args" field. This field only gets set when funcargs option is used. But the args are not recorded in this temp data and garbage could be printed. The entry field is copied via: data->ent = *curr; Where "curr" is the entry field. But this method only saves the non variable length fields from the structure. Add a helper structure to the iterator data that adds the max args size to the data storage in the iterator. Then simply copy the entire entry into this storage (with size protection). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/20250820195522.51d4a268@gandalf.local.home Reported-by: Sasha Levin <sashal@kernel.org> Tested-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/all/aJaxRVKverIjF4a6@lappy/ Fixes: ff5c9c576e75 ("ftrace: Add support for function argument to graph tracer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22bpf: Remove preempt_disable in bpf_try_get_buffersTao Chen1-3/+0
Now BPF program will run with migration disabled, so it is safe to access this_cpu_inc_return(bpf_bprintf_nest_level). Fixes: d9c9e4db186a ("bpf: Factorize bpf_trace_printk and bpf_seq_printf") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250819125638.2544715-1-chen.dylane@linux.dev
2025-08-22bpf: Use sha1() instead of sha1_transform() in bpf_prog_calc_tag()Eric Biggers1-41/+9
Now that there's a proper SHA-1 library API, just use that instead of the low-level SHA-1 compression function. This eliminates the need for bpf_prog_calc_tag() to implement the SHA-1 padding itself. No functional change; the computed tags remain the same. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20250811201615.564461-1-ebiggers@kernel.org
2025-08-22cgroup: cgroup.stat.local time accountingTiffany Yang2-4/+40
There isn't yet a clear way to identify a set of "lost" time that everyone (or at least a wider group of users) cares about. However, users can perform some delay accounting by iterating over components of interest. This patch allows cgroup v2 freezing time to be one of those components. Track the cumulative time that each v2 cgroup spends freezing and expose it to userland via a new local stat file in cgroupfs. Thank you to Michal, who provided the ASCII art in the updated documentation. To access this value: $ mkdir /sys/fs/cgroup/test $ cat /sys/fs/cgroup/test/cgroup.stat.local freeze_time_total 0 Ensure consistent freeze time reads with freeze_seq, a per-cgroup sequence counter. Writes are serialized using the css_set_lock. Signed-off-by: Tiffany Yang <ynaffit@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-22cgroup/psi: Set of->priv to NULL upon file releaseChen Ridong1-0/+1
Setting of->priv to NULL when the file is released enables earlier bug detection. This allows potential bugs to manifest as NULL pointer dereferences rather than use-after-free errors[1], which are generally more difficult to diagnose. [1] https://lore.kernel.org/cgroups/38ef3ff9-b380-44f0-9315-8b3714b0948d@huaweicloud.com/T/#m8a3b3f88f0ff3da5925d342e90043394f8b2091b Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-22cgroup: split cgroup_destroy_wq into 3 workqueuesChen Ridong1-7/+36
A hung task can occur during [1] LTP cgroup testing when repeatedly mounting/unmounting perf_event and net_prio controllers with systemd.unified_cgroup_hierarchy=1. The hang manifests in cgroup_lock_and_drain_offline() during root destruction. Related case: cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio Call Trace: cgroup_lock_and_drain_offline+0x14c/0x1e8 cgroup_destroy_root+0x3c/0x2c0 css_free_rwork_fn+0x248/0x338 process_one_work+0x16c/0x3b8 worker_thread+0x22c/0x3b0 kthread+0xec/0x100 ret_from_fork+0x10/0x20 Root Cause: CPU0 CPU1 mount perf_event umount net_prio cgroup1_get_tree cgroup_kill_sb rebind_subsystems // root destruction enqueues // cgroup_destroy_wq // kill all perf_event css // one perf_event css A is dying // css A offline enqueues cgroup_destroy_wq // root destruction will be executed first css_free_rwork_fn cgroup_destroy_root cgroup_lock_and_drain_offline // some perf descendants are dying // cgroup_destroy_wq max_active = 1 // waiting for css A to die Problem scenario: 1. CPU0 mounts perf_event (rebind_subsystems) 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work 3. A dying perf_event CSS gets queued for offline after root destruction 4. Root destruction waits for offline completion, but offline work is blocked behind root destruction in cgroup_destroy_wq (max_active=1) Solution: Split cgroup_destroy_wq into three dedicated workqueues: cgroup_offline_wq – Handles CSS offline operations cgroup_release_wq – Manages resource release cgroup_free_wq – Performs final memory deallocation This separation eliminates blocking in the CSS free path while waiting for offline operations to complete. [1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends") Reported-by: Gao Yingjie <gaoyingjie@uniontech.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Suggested-by: Teju Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-22bpf: Use tnums for JEQ/JNE is_branch_taken logicPaul Chaignon2-0/+12
In the following toy program (reg states minimized for readability), R0 and R1 always have different values at instruction 6. This is obvious when reading the program but cannot be guessed from ranges alone as they overlap (R0 in [0; 0xc0000000], R1 in [1024; 0xc0000400]). 0: call bpf_get_prandom_u32#7 ; R0_w=scalar() 1: w0 = w0 ; R0_w=scalar(var_off=(0x0; 0xffffffff)) 2: r0 >>= 30 ; R0_w=scalar(var_off=(0x0; 0x3)) 3: r0 <<= 30 ; R0_w=scalar(var_off=(0x0; 0xc0000000)) 4: r1 = r0 ; R1_w=scalar(var_off=(0x0; 0xc0000000)) 5: r1 += 1024 ; R1_w=scalar(var_off=(0x400; 0xc0000000)) 6: if r1 != r0 goto pc+1 Looking at tnums however, we can deduce that R1 is always different from R0 because their tnums don't agree on known bits. This patch uses this logic to improve is_scalar_branch_taken in case of BPF_JEQ and BPF_JNE. This change has a tiny impact on complexity, which was measured with the Cilium complexity CI test. That test covers 72 programs with various build and load time configurations for a total of 970 test cases. For 80% of test cases, the patch has no impact. On the other test cases, the patch decreases complexity by only 0.08% on average. In the best case, the verifier needs to walk 3% less instructions and, in the worst case, 1.5% more. Overall, the patch has a small positive impact, especially for our largest programs. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/be3ee70b6e489c49881cb1646114b1d861b5c334.1755694147.git.paul.chaignon@gmail.com
2025-08-22refscale: Use kcalloc() instead of kzalloc()Qianfeng Rong1-1/+1
Use kcalloc() in main_func() to gain built-in overflow protection, making memory allocation safer when calculating allocation size compared to explicit multiplication. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-08-22rcutorture: Use kcalloc() instead of kzalloc()Qianfeng Rong1-2/+2
Use kcalloc() in rcu_torture_writer() to gain built-in overflow protection, making memory allocation safer when calculating allocation size compared to explicit multiplication. Change sizeof(ulo[0]) and sizeof(rgo[0]) to sizeof(*ulo) and sizeof(*rgo), as this is more consistent with coding conventions. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-08-22Merge tag 'mm-hotfixes-stable-2025-08-21-18-17' of ↵Linus Torvalds2-4/+26
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes. 10 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 17 of these fixes are for MM. As usual, singletons all over the place, apart from a three-patch series of KHO followup work from Pasha which is actually also a bunch of singletons" * tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/mremap: fix WARN with uffd that has remap events disabled mm/damon/sysfs-schemes: put damos dests dir after removing its files mm/migrate: fix NULL movable_ops if CONFIG_ZSMALLOC=m mm/damon/core: fix damos_commit_filter not changing allow mm/memory-failure: fix infinite UCE for VM_PFNMAP pfn MAINTAINERS: mark MGLRU as maintained mm: rust: add page.rs to MEMORY MANAGEMENT - RUST iov_iter: iterate_folioq: fix handling of offset >= folio size selftests/damon: fix selftests by installing drgn related script .mailmap: add entry for Easwar Hariharan selftests/mm: add test for invalid multi VMA operations mm/mremap: catch invalid multi VMA moves earlier mm/mremap: allow multi-VMA move when filesystem uses thp_get_unmapped_area mm/damon/core: fix commit_ops_filters by using correct nth function tools/testing: add linux/args.h header and fix radix, VMA tests mm/debug_vm_pgtable: clear page table entries at destroy_args() squashfs: fix memory leak in squashfs_fill_super kho: warn if KHO is disabled due to an error kho: mm: don't allow deferred struct page with KHO kho: init new_physxa->phys_bits to fix lockdep
2025-08-22padata: Reset next CPU when reorder sequence wraps aroundXiao Liang1-1/+5
When seq_nr wraps around, the next reorder job with seq 0 is hashed to the first CPU in padata_do_serial(). Correspondingly, need reset pd->cpu to the first one when pd->processed wraps around. Otherwise, if the number of used CPUs is not a power of 2, padata_find_next() will be checking a wrong list, hence deadlock. Fixes: 6fc4dbcf0276 ("padata: Replace delayed timer with immediate workqueue in padata_reorder") Cc: <stable@vger.kernel.org> Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-08-21Merge tag 'for-netdev' of ↵Jakub Kicinski3-2/+26
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-08-21 We've added 9 non-merge commits during the last 3 day(s) which contain a total of 13 files changed, 1027 insertions(+), 27 deletions(-). The main changes are: 1) Added bpf dynptr support for accessing the metadata of a skb, from Jakub Sitnicki. The patches are merged from a stable branch bpf-next/skb-meta-dynptr. The same patches have also been merged into bpf-next/master. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Cover metadata access from a modified skb clone selftests/bpf: Cover read/write to skb metadata at an offset selftests/bpf: Cover write access to skb metadata via dynptr selftests/bpf: Cover read access to skb metadata via dynptr selftests/bpf: Parametrize test_xdp_context_tuntap selftests/bpf: Pass just bpf_map to xdp_context_test helper selftests/bpf: Cover verifier checks for skb_meta dynptr type bpf: Enable read/write access to skb metadata through a dynptr bpf: Add dynptr type for skb metadata ==================== Link: https://patch.msgid.link/20250821191827.2099022-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21Merge tag 'cgroup-for-6.17-rc2-fixes' of ↵Linus Torvalds2-6/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix NULL de-ref in css_rstat_exit() which could happen after allocation failure - Fix a cpuset partition handling bug and a couple other misc issues - Doc spelling fix * tag 'cgroup-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup: fixed spelling mistakes in documentation cgroup: avoid null de-ref in css_rstat_exit() cgroup/cpuset: Remove the unnecessary css_get/put() in cpuset_partition_write() cgroup/cpuset: Fix a partition error with CPU hotplug cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key
2025-08-21Merge tag 'sched_ext-for-6.17-rc2-fixes' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix a subtle bug during SCX enabling where a dead task skips init but doesn't skip sched class switch leading to invalid task state transition warning - Cosmetic fix in selftests * tag 'sched_ext-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: selftests/sched_ext: Remove duplicate sched.h header sched/ext: Fix invalid task state transitions on class switch
2025-08-21uprobes: Remove redundant __GFP_NOWARNQianfeng Rong1-1/+1
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made GFP_NOWAIT implicitly include __GFP_NOWARN. Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean up these redundant flags across subsystems. No functional changes. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250805025000.346647-1-rongqianfeng@vivo.com
2025-08-21seccomp: passthrough uprobe systemcall without filteringJiri Olsa1-7/+25
Adding uprobe as another exception to the seccomp filter alongside with the uretprobe syscall. Same as the uretprobe the uprobe syscall is installed by kernel as replacement for the breakpoint exception and is limited to x86_64 arch and isn't expected to ever be supported in i386. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/r/20250720112133.244369-21-jolsa@kernel.org
2025-08-21uprobes/x86: Add support to optimize uprobesJiri Olsa1-4/+12
Putting together all the previously added pieces to support optimized uprobes on top of 5-byte nop instruction. The current uprobe execution goes through following: - installs breakpoint instruction over original instruction - exception handler hit and calls related uprobe consumers - and either simulates original instruction or does out of line single step execution of it - returns to user space The optimized uprobe path does following: - checks the original instruction is 5-byte nop (plus other checks) - adds (or uses existing) user space trampoline with uprobe syscall - overwrites original instruction (5-byte nop) with call to user space trampoline - the user space trampoline executes uprobe syscall that calls related uprobe consumers - trampoline returns back to next instruction This approach won't speed up all uprobes as it's limited to using nop5 as original instruction, but we plan to use nop5 as USDT probe instruction (which currently uses single byte nop) and speed up the USDT probes. The arch_uprobe_optimize triggers the uprobe optimization and is called after first uprobe hit. I originally had it called on uprobe installation but then it clashed with elf loader, because the user space trampoline was added in a place where loader might need to put elf segments, so I decided to do it after first uprobe hit when loading is done. The uprobe is un-optimized in arch specific set_orig_insn call. The instruction overwrite is x86 arch specific and needs to go through 3 updates: (on top of nop5 instruction) - write int3 into 1st byte - write last 4 bytes of the call instruction - update the call instruction opcode And cleanup goes though similar reverse stages: - overwrite call opcode with breakpoint (int3) - write last 4 bytes of the nop5 instruction - write the nop5 first instruction byte We do not unmap and release uprobe trampoline when it's no longer needed, because there's no easy way to make sure none of the threads is still inside the trampoline. But we do not waste memory, because there's just single page for all the uprobe trampoline mappings. We do waste frame on page mapping for every 4GB by keeping the uprobe trampoline page mapped, but that seems ok. We take the benefit from the fact that set_swbp and set_orig_insn are called under mmap_write_lock(mm), so we can use the current instruction as the state the uprobe is in - nop5/breakpoint/call trampoline - and decide the needed action (optimize/un-optimize) based on that. Attaching the speed up from benchs/run_bench_uprobes.sh script: current: usermode-count : 152.604 ± 0.044M/s syscall-count : 13.359 ± 0.042M/s --> uprobe-nop : 3.229 ± 0.002M/s uprobe-push : 3.086 ± 0.004M/s uprobe-ret : 1.114 ± 0.004M/s uprobe-nop5 : 1.121 ± 0.005M/s uretprobe-nop : 2.145 ± 0.002M/s uretprobe-push : 2.070 ± 0.001M/s uretprobe-ret : 0.931 ± 0.001M/s uretprobe-nop5 : 0.957 ± 0.001M/s after the change: usermode-count : 152.448 ± 0.244M/s syscall-count : 14.321 ± 0.059M/s uprobe-nop : 3.148 ± 0.007M/s uprobe-push : 2.976 ± 0.004M/s uprobe-ret : 1.068 ± 0.003M/s --> uprobe-nop5 : 7.038 ± 0.007M/s uretprobe-nop : 2.109 ± 0.004M/s uretprobe-push : 2.035 ± 0.001M/s uretprobe-ret : 0.908 ± 0.001M/s uretprobe-nop5 : 3.377 ± 0.009M/s I see bit more speed up on Intel (above) compared to AMD. The big nop5 speed up is partly due to emulating nop5 and partly due to optimization. The key speed up we do this for is the USDT switch from nop to nop5: uprobe-nop : 3.148 ± 0.007M/s uprobe-nop5 : 7.038 ± 0.007M/s Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250720112133.244369-11-jolsa@kernel.org
2025-08-21uprobes/x86: Add uprobe syscall to speed up uprobeJiri Olsa2-0/+18
Adding new uprobe syscall that calls uprobe handlers for given 'breakpoint' address. The idea is that the 'breakpoint' address calls the user space trampoline which executes the uprobe syscall. The syscall handler reads the return address of the initial call to retrieve the original 'breakpoint' address. With this address we find the related uprobe object and call its consumers. Adding the arch_uprobe_trampoline_mapping function that provides uprobe trampoline mapping. This mapping is backed with one global page initialized at __init time and shared by the all the mapping instances. We do not allow to execute uprobe syscall if the caller is not from uprobe trampoline mapping. The uprobe syscall ensures the consumer (bpf program) sees registers values in the state before the trampoline was called. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250720112133.244369-10-jolsa@kernel.org
2025-08-21uprobes/x86: Add mapping for optimized uprobe trampolinesJiri Olsa2-0/+11
Adding support to add special mapping for user space trampoline with following functions: uprobe_trampoline_get - find or add uprobe_trampoline uprobe_trampoline_put - remove or destroy uprobe_trampoline The user space trampoline is exported as arch specific user space special mapping through tramp_mapping, which is initialized in following changes with new uprobe syscall. The uprobe trampoline needs to be callable/reachable from the probed address, so while searching for available address we use is_reachable_by_call function to decide if the uprobe trampoline is callable from the probe address. All uprobe_trampoline objects are stored in uprobes_state object and are cleaned up when the process mm_struct goes down. Adding new arch hooks for that, because this change is x86_64 specific. Locking is provided by callers in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250720112133.244369-9-jolsa@kernel.org
2025-08-21uprobes: Add do_ref_ctr argument to uprobe_write functionJiri Olsa1-4/+4
Making update_ref_ctr call in uprobe_write conditional based on do_ref_ctr argument. This way we can use uprobe_write for instruction update without doing ref_ctr_offset update. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-8-jolsa@kernel.org
2025-08-21uprobes: Add is_register argument to uprobe_write and uprobe_write_opcodeJiri Olsa1-10/+11
The uprobe_write has special path to restore the original page when we write original instruction back. This happens when uprobe_write detects that we want to write anything else but breakpoint instruction. Moving the detection away and passing it to uprobe_write as argument, so it's possible to write different instructions (other than just breakpoint and rest). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-7-jolsa@kernel.org
2025-08-21uprobes: Add nbytes argument to uprobe_writeJiri Olsa1-12/+14
Adding nbytes argument to uprobe_write and related functions as preparation for writing whole instructions in following changes. Also renaming opcode arguments to insn, which seems to fit better. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-6-jolsa@kernel.org
2025-08-21uprobes: Add uprobe_write functionJiri Olsa1-4/+10
Adding uprobe_write function that does what uprobe_write_opcode did so far, but allows to pass verify callback function that checks the memory location before writing the opcode. It will be used in following changes to implement specific checking logic for instruction update. The uprobe_write_opcode now calls uprobe_write with verify_opcode as the verify callback. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-5-jolsa@kernel.org
2025-08-21uprobes: Make copy_from_page globalJiri Olsa1-5/+5
Making copy_from_page global and adding uprobe prefix. Adding the uprobe prefix to copy_to_page as well for symmetry. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-4-jolsa@kernel.org
2025-08-21uprobes: Rename arch_uretprobe_trampoline functionJiri Olsa1-2/+2
We are about to add uprobe trampoline, so cleaning up the namespace. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-3-jolsa@kernel.org
2025-08-21uprobes: Remove breakpoint in unapply_uprobe under mmap_write_lockJiri Olsa1-3/+3
Currently unapply_uprobe takes mmap_read_lock, but it might call remove_breakpoint which eventually changes user pages. Current code writes either breakpoint or original instruction, so it can go away with read lock as explained in here [1]. But with the upcoming change that writes multiple instructions on the probed address we need to ensure that any update to mm's pages is exclusive. [1] https://lore.kernel.org/all/20240710140045.GA1084@redhat.com/ Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-2-jolsa@kernel.org
2025-08-20Merge tag 'probes-fixes-v6.17-rc2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: "Sanitize wildcard for fprobe event name Fprobe event accepts wildcards for the target functions, but unless the user specifies its event name, it makes an event with the wildcards. Replace the wildcard '*' with the underscore '_'" * tag 'probes-fixes-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: fprobe-event: Sanitize wildcard for fprobe event name
2025-08-20tracing: fprobe-event: Sanitize wildcard for fprobe event nameMasami Hiramatsu (Google)1-1/+1
Fprobe event accepts wildcards for the target functions, but unless user specifies its event name, it makes an event with the wildcards. /sys/kernel/tracing # echo 'f mutex*' >> dynamic_events /sys/kernel/tracing # cat dynamic_events f:fprobes/mutex*__entry mutex* /sys/kernel/tracing # ls events/fprobes/ enable filter mutex*__entry To fix this, replace the wildcard ('*') with an underscore. Link: https://lore.kernel.org/all/175535345114.282990.12294108192847938710.stgit@devnote2/ Fixes: 334e5519c375 ("tracing/probes: Add fprobe events for tracing function entry and exit.") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org
2025-08-20trace/fgraph: Fix the warning caused by missing unregister notifierYe Weihua1-0/+1
This warning was triggered during testing on v6.16: notifier callback ftrace_suspend_notifier_call already registered WARNING: CPU: 2 PID: 86 at kernel/notifier.c:23 notifier_chain_register+0x44/0xb0 ... Call Trace: <TASK> blocking_notifier_chain_register+0x34/0x60 register_ftrace_graph+0x330/0x410 ftrace_profile_write+0x1e9/0x340 vfs_write+0xf8/0x420 ? filp_flush+0x8a/0xa0 ? filp_close+0x1f/0x30 ? do_dup2+0xaf/0x160 ksys_write+0x65/0xe0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f When writing to the function_profile_enabled interface, the notifier was not unregistered after start_graph_tracing failed, causing a warning the next time function_profile_enabled was written. Fixed by adding unregister_pm_notifier in the exception path. Link: https://lore.kernel.org/20250818073332.3890629-1-yeweihua4@huawei.com Fixes: 4a2b8dda3f870 ("tracing/function-graph-tracer: fix a regression while suspend to disk") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Ye Weihua <yeweihua4@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-20ring-buffer: Remove redundant semicolonsLiao Yuanhong1-1/+1
Remove unnecessary semicolons. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250813095114.559530-1-liaoyuanhong@vivo.com Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-20tracing: Limit access to parser->buffer when trace_get_user failedPu Lehui2-7/+19
When the length of the string written to set_ftrace_filter exceeds FTRACE_BUFF_MAX, the following KASAN alarm will be triggered: BUG: KASAN: slab-out-of-bounds in strsep+0x18c/0x1b0 Read of size 1 at addr ffff0000d00bd5ba by task ash/165 CPU: 1 UID: 0 PID: 165 Comm: ash Not tainted 6.16.0-g6bcdbd62bd56-dirty Hardware name: linux,dummy-virt (DT) Call trace: show_stack+0x34/0x50 (C) dump_stack_lvl+0xa0/0x158 print_address_description.constprop.0+0x88/0x398 print_report+0xb0/0x280 kasan_report+0xa4/0xf0 __asan_report_load1_noabort+0x20/0x30 strsep+0x18c/0x1b0 ftrace_process_regex.isra.0+0x100/0x2d8 ftrace_regex_release+0x484/0x618 __fput+0x364/0xa58 ____fput+0x28/0x40 task_work_run+0x154/0x278 do_notify_resume+0x1f0/0x220 el0_svc+0xec/0xf0 el0t_64_sync_handler+0xa0/0xe8 el0t_64_sync+0x1ac/0x1b0 The reason is that trace_get_user will fail when processing a string longer than FTRACE_BUFF_MAX, but not set the end of parser->buffer to 0. Then an OOB access will be triggered in ftrace_regex_release-> ftrace_process_regex->strsep->strpbrk. We can solve this problem by limiting access to parser->buffer when trace_get_user failed. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250813040232.1344527-1-pulehui@huaweicloud.com Fixes: 8c9af478c06b ("ftrace: Handle commands when closing set_ftrace_filter file") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-19kho: warn if KHO is disabled due to an errorPasha Tatashin1-0/+1
During boot scratch area is allocated based on command line parameters or auto calculated. However, scratch area may fail to allocate, and in that case KHO is disabled. Currently, no warning is printed that KHO is disabled, which makes it confusing for the end user to figure out why KHO is not available. Add the missing warning message. Link: https://lkml.kernel.org/r/20250808201804.772010-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-19kho: mm: don't allow deferred struct page with KHOPasha Tatashin1-0/+1
KHO uses struct pages for the preserved memory early in boot, however, with deferred struct page initialization, only a small portion of memory has properly initialized struct pages. This problem was detected where vmemmap is poisoned, and illegal flag combinations are detected. Don't allow them to be enabled together, and later we will have to teach KHO to work properly with deferred struct page init kernel feature. Link: https://lkml.kernel.org/r/20250808201804.772010-3-pasha.tatashin@soleen.com Fixes: 4e1d010e3bda ("kexec: add config option for KHO") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-19kho: init new_physxa->phys_bits to fix lockdepPasha Tatashin1-4/+24
Patch series "Several KHO Hotfixes". Three unrelated fixes for Kexec Handover. This patch (of 3): Lockdep shows the following warning: INFO: trying to register non-static key. The code is fine but needs lockdep annotation, or maybe you didn't initialize this object before use? turning off the locking correctness validator. [<ffffffff810133a6>] dump_stack_lvl+0x66/0xa0 [<ffffffff8136012c>] assign_lock_key+0x10c/0x120 [<ffffffff81358bb4>] register_lock_class+0xf4/0x2f0 [<ffffffff813597ff>] __lock_acquire+0x7f/0x2c40 [<ffffffff81360cb0>] ? __pfx_hlock_conflict+0x10/0x10 [<ffffffff811707be>] ? native_flush_tlb_global+0x8e/0xa0 [<ffffffff8117096e>] ? __flush_tlb_all+0x4e/0xa0 [<ffffffff81172fc2>] ? __kernel_map_pages+0x112/0x140 [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0 [<ffffffff81359556>] lock_acquire+0xe6/0x280 [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0 [<ffffffff8100b9e0>] _raw_spin_lock+0x30/0x40 [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0 [<ffffffff813ec327>] xa_load_or_alloc+0x67/0xe0 [<ffffffff813eb4c0>] kho_preserve_folio+0x90/0x100 [<ffffffff813ebb7f>] __kho_finalize+0xcf/0x400 [<ffffffff813ebef4>] kho_finalize+0x34/0x70 This is becase xa has its own lock, that is not initialized in xa_load_or_alloc. Modifiy __kho_preserve_order(), to properly call xa_init(&new_physxa->phys_bits); Link: https://lkml.kernel.org/r/20250808201804.772010-2-pasha.tatashin@soleen.com Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Vasilevsky <dave@vasilevsky.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-19Merge tag 'vfs-6.17-rc3.fixes' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix two memory leaks in pidfs - Prevent changing the idmapping of an already idmapped mount without OPEN_TREE_CLONE through open_tree_attr() - Don't fail listing extended attributes in kernfs when no extended attributes are set - Fix the return value in coredump_parse() - Fix the error handling for unbuffered writes in netfs - Fix broken data integrity guarantees for O_SYNC writes via iomap - Fix UAF in __mark_inode_dirty() - Keep inode->i_blkbits constant in fuse - Fix coredump selftests - Fix get_unused_fd_flags() usage in do_handle_open() - Rename EXPORT_SYMBOL_GPL_FOR_MODULES to EXPORT_SYMBOL_FOR_MODULES - Fix use-after-free in bh_read() - Fix incorrect lflags value in the move_mount() syscall * tag 'vfs-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: signal: Fix memory leak for PIDFD_SELF* sentinels kernfs: don't fail listing extended attributes coredump: Fix return value in coredump_parse() fs/buffer: fix use-after-free when call bh_read() helper pidfs: Fix memory leak in pidfd_info() netfs: Fix unbuffered write error handling fhandle: do_handle_open() should get FD with user flags module: Rename EXPORT_SYMBOL_GPL_FOR_MODULES to EXPORT_SYMBOL_FOR_MODULES fs: fix incorrect lflags value in the move_mount syscall selftests/coredump: Remove the read() that fails the test fuse: keep inode->i_blkbits constant iomap: Fix broken data integrity guarantees for O_SYNC writes selftests/mount_setattr: add smoke tests for open_tree_attr(2) bug open_tree_attr: do not allow id-mapping changes without OPEN_TREE_CLONE fs: writeback: fix use-after-free in __mark_inode_dirty()
2025-08-19signal: Fix memory leak for PIDFD_SELF* sentinelsAdrian Huang (Lenovo)1-1/+5
Commit f08d0c3a7111 ("pidfd: add PIDFD_SELF* sentinels to refer to own thread/process") introduced a leak by acquiring a pid reference through get_task_pid(), which increments pid->count but never drops it with put_pid(). As a result, kmemleak reports unreferenced pid objects after running tools/testing/selftests/pidfd/pidfd_test, for example: unreferenced object 0xff1100206757a940 (size 160): comm "pidfd_test", pid 16965, jiffies 4294853028 hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 fd 57 50 04 .............WP. 5e 44 00 00 00 00 00 00 18 de 34 17 01 00 11 ff ^D........4..... backtrace (crc cd8844d4): kmem_cache_alloc_noprof+0x2f4/0x3f0 alloc_pid+0x54/0x3d0 copy_process+0xd58/0x1740 kernel_clone+0x99/0x3b0 __do_sys_clone3+0xbe/0x100 do_syscall_64+0x7b/0x2c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fix this by calling put_pid() after do_pidfd_send_signal() returns. Fixes: f08d0c3a7111 ("pidfd: add PIDFD_SELF* sentinels to refer to own thread/process") Signed-off-by: Adrian Huang (Lenovo) <adrianhuang0701@gmail.com> Link: https://lore.kernel.org/20250818134310.12273-1-adrianhuang0701@gmail.com Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-19pid: change bacct_add_tsk() to use task_ppid_nr_ns()Oleg Nesterov1-2/+1
to simplify the code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/20250810173615.GA20000@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-19pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callersOleg Nesterov1-1/+2
task_pid_vnr(another_task) will crash if the caller was already reaped. The pid_alive(current) check can't really help, the parent/debugger can call release_task() right after this check. This also means that even task_ppid_nr_ns(current, NULL) is not safe, pid_alive() only ensures that it is safe to dereference ->real_parent. Change __task_pid_nr_ns() to ensure ns != NULL. Originally-by: 高翔 <gaoxiang17@xiaomi.com> Link: https://lore.kernel.org/all/20250802022123.3536934-1-gxxa03070307@gmail.com/ Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/20250810173604.GA19991@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-19pid: Add a judgment for ns null in pid_nr_nsgaoxiang171-1/+1
__task_pid_nr_ns ns = task_active_pid_ns(current); pid_nr_ns(rcu_dereference(*task_pid_ptr(task, type)), ns); if (pid && ns->level <= pid->level) { Sometimes null is returned for task_active_pid_ns. Then it will trigger kernel panic in pid_nr_ns. For example: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058 Mem abort info: ESR = 0x0000000096000007 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x07: level 3 translation fault Data abort info: ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 39-bit VAs, pgdp=00000002175aa000 [0000000000000058] pgd=08000002175ab003, p4d=08000002175ab003, pud=08000002175ab003, pmd=08000002175be003, pte=0000000000000000 pstate: 834000c5 (Nzcv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : __task_pid_nr_ns+0x74/0xd0 lr : __task_pid_nr_ns+0x24/0xd0 sp : ffffffc08001bd10 x29: ffffffc08001bd10 x28: ffffffd4422b2000 x27: 0000000000000001 x26: ffffffd442821168 x25: ffffffd442821000 x24: 00000f89492eab31 x23: 00000000000000c0 x22: ffffff806f5693c0 x21: ffffff806f5693c0 x20: 0000000000000001 x19: 0000000000000000 x18: 0000000000000000 x17: 00000000529c6ef0 x16: 00000000529c6ef0 x15: 00000000023a1adc x14: 0000000000000003 x13: 00000000007ef6d8 x12: 001167c391c78800 x11: 00ffffffffffffff x10: 0000000000000000 x9 : 0000000000000001 x8 : ffffff80816fa3c0 x7 : 0000000000000000 x6 : 49534d702d535449 x5 : ffffffc080c4c2c0 x4 : ffffffd43ee128c8 x3 : ffffffd43ee124dc x2 : 0000000000000000 x1 : 0000000000000001 x0 : ffffff806f5693c0 Call trace: __task_pid_nr_ns+0x74/0xd0 ... __handle_irq_event_percpu+0xd4/0x284 handle_irq_event+0x48/0xb0 handle_fasteoi_irq+0x160/0x2d8 generic_handle_domain_irq+0x44/0x60 gic_handle_irq+0x4c/0x114 call_on_irq_stack+0x3c/0x74 do_interrupt_handler+0x4c/0x84 el1_interrupt+0x34/0x58 el1h_64_irq_handler+0x18/0x24 el1h_64_irq+0x68/0x6c account_kernel_stack+0x60/0x144 exit_task_stack_account+0x1c/0x80 do_exit+0x7e4/0xaf8 ... get_signal+0x7bc/0x8d8 do_notify_resume+0x128/0x828 el0_svc+0x6c/0x70 el0t_64_sync_handler+0x68/0xbc el0t_64_sync+0x1a8/0x1ac Code: 35fffe54 911a02a8 f9400108 b4000128 (b9405a69) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception in interrupt Signed-off-by: gaoxiang17 <gaoxiang17@xiaomi.com> Link: https://lore.kernel.org/20250802022123.3536934-1-gxxa03070307@gmail.com Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-19kcsan: test: Replace deprecated strcpy() with strscpy()Thorsten Blum1-2/+2
strcpy() is deprecated; use strscpy() instead. Link: https://github.com/KSPP/linux/issues/88 Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Marco Elver <elver@google.com>
2025-08-18Merge branch 'bpf-next/skb-meta-dynptr' into 'bpf-next/master'Martin KaFai Lau3-2/+26
Merge 'skb-meta-dynptr' branch into 'master' branch. No conflict. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-08-18Merge branch 'bpf-next/skb-meta-dynptr' into 'bpf-next/net'Martin KaFai Lau3-2/+26
Merge 'skb-meta-dynptr' branch into 'net' branch. No conflict. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-08-18bpf: Enable read/write access to skb metadata through a dynptrJakub Sitnicki1-3/+7
Now that we can create a dynptr to skb metadata, make reads to the metadata area possible with bpf_dynptr_read() or through a bpf_dynptr_slice(), and make writes to the metadata area possible with bpf_dynptr_write() or through a bpf_dynptr_slice_rdwr(). Note that for cloned skbs which share data with the original, we limit the skb metadata dynptr to be read-only since we don't unclone on a bpf_dynptr_write to metadata. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-2-8a39e636e0fb@cloudflare.com
2025-08-18bpf: Add dynptr type for skb metadataJakub Sitnicki3-2/+22
Add a dynptr type, similar to skb dynptr, but for the skb metadata access. The dynptr provides an alternative to __sk_buff->data_meta for accessing the custom metadata area allocated using the bpf_xdp_adjust_meta() helper. More importantly, it abstracts away the fact where the storage for the custom metadata lives, which opens up the way to persist the metadata by relocating it as the skb travels through the network stack layers. Writes to skb metadata invalidate any existing skb payload and metadata slices. While this is more restrictive that needed at the moment, it leaves the door open to reallocating the metadata on writes, and should be only a minor inconvenience to the users. Only the program types which can access __sk_buff->data_meta today are allowed to create a dynptr for skb metadata at the moment. We need to modify the network stack to persist the metadata across layers before opening up access to other BPF hooks. Once more BPF hooks gain access to skb_meta dynptr, we will also need to add a read-only variant of the helper similar to bpf_dynptr_from_skb_rdonly. skb_meta dynptr ops are stubbed out and implemented by subsequent changes. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-1-8a39e636e0fb@cloudflare.com