| Age | Commit message (Collapse) | Author | Files | Lines |
|
Use kcalloc() in main_func() to gain built-in overflow protection, making
memory allocation safer when calculating allocation size compared to
explicit multiplication.
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Use kcalloc() in rcu_torture_writer() to gain built-in overflow protection,
making memory allocation safer when calculating allocation size compared to
explicit multiplication.
Change sizeof(ulo[0]) and sizeof(rgo[0]) to sizeof(*ulo) and sizeof(*rgo),
as this is more consistent with coding conventions.
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes. 10 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 17 of these
fixes are for MM.
As usual, singletons all over the place, apart from a three-patch
series of KHO followup work from Pasha which is actually also a bunch
of singletons"
* tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/mremap: fix WARN with uffd that has remap events disabled
mm/damon/sysfs-schemes: put damos dests dir after removing its files
mm/migrate: fix NULL movable_ops if CONFIG_ZSMALLOC=m
mm/damon/core: fix damos_commit_filter not changing allow
mm/memory-failure: fix infinite UCE for VM_PFNMAP pfn
MAINTAINERS: mark MGLRU as maintained
mm: rust: add page.rs to MEMORY MANAGEMENT - RUST
iov_iter: iterate_folioq: fix handling of offset >= folio size
selftests/damon: fix selftests by installing drgn related script
.mailmap: add entry for Easwar Hariharan
selftests/mm: add test for invalid multi VMA operations
mm/mremap: catch invalid multi VMA moves earlier
mm/mremap: allow multi-VMA move when filesystem uses thp_get_unmapped_area
mm/damon/core: fix commit_ops_filters by using correct nth function
tools/testing: add linux/args.h header and fix radix, VMA tests
mm/debug_vm_pgtable: clear page table entries at destroy_args()
squashfs: fix memory leak in squashfs_fill_super
kho: warn if KHO is disabled due to an error
kho: mm: don't allow deferred struct page with KHO
kho: init new_physxa->phys_bits to fix lockdep
|
|
When seq_nr wraps around, the next reorder job with seq 0 is hashed to
the first CPU in padata_do_serial(). Correspondingly, need reset pd->cpu
to the first one when pd->processed wraps around. Otherwise, if the
number of used CPUs is not a power of 2, padata_find_next() will be
checking a wrong list, hence deadlock.
Fixes: 6fc4dbcf0276 ("padata: Replace delayed timer with immediate workqueue in padata_reorder")
Cc: <stable@vger.kernel.org>
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2025-08-21
We've added 9 non-merge commits during the last 3 day(s) which contain
a total of 13 files changed, 1027 insertions(+), 27 deletions(-).
The main changes are:
1) Added bpf dynptr support for accessing the metadata of a skb,
from Jakub Sitnicki.
The patches are merged from a stable branch bpf-next/skb-meta-dynptr.
The same patches have also been merged into bpf-next/master.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
selftests/bpf: Cover metadata access from a modified skb clone
selftests/bpf: Cover read/write to skb metadata at an offset
selftests/bpf: Cover write access to skb metadata via dynptr
selftests/bpf: Cover read access to skb metadata via dynptr
selftests/bpf: Parametrize test_xdp_context_tuntap
selftests/bpf: Pass just bpf_map to xdp_context_test helper
selftests/bpf: Cover verifier checks for skb_meta dynptr type
bpf: Enable read/write access to skb metadata through a dynptr
bpf: Add dynptr type for skb metadata
====================
Link: https://patch.msgid.link/20250821191827.2099022-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Fix NULL de-ref in css_rstat_exit() which could happen after
allocation failure
- Fix a cpuset partition handling bug and a couple other misc issues
- Doc spelling fix
* tag 'cgroup-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup: fixed spelling mistakes in documentation
cgroup: avoid null de-ref in css_rstat_exit()
cgroup/cpuset: Remove the unnecessary css_get/put() in cpuset_partition_write()
cgroup/cpuset: Fix a partition error with CPU hotplug
cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix a subtle bug during SCX enabling where a dead task skips init
but doesn't skip sched class switch leading to invalid task state
transition warning
- Cosmetic fix in selftests
* tag 'sched_ext-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
selftests/sched_ext: Remove duplicate sched.h header
sched/ext: Fix invalid task state transitions on class switch
|
|
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT")
made GFP_NOWAIT implicitly include __GFP_NOWARN.
Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT
(e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean
up these redundant flags across subsystems.
No functional changes.
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250805025000.346647-1-rongqianfeng@vivo.com
|
|
Adding uprobe as another exception to the seccomp filter alongside
with the uretprobe syscall.
Same as the uretprobe the uprobe syscall is installed by kernel as
replacement for the breakpoint exception and is limited to x86_64
arch and isn't expected to ever be supported in i386.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250720112133.244369-21-jolsa@kernel.org
|
|
Putting together all the previously added pieces to support optimized
uprobes on top of 5-byte nop instruction.
The current uprobe execution goes through following:
- installs breakpoint instruction over original instruction
- exception handler hit and calls related uprobe consumers
- and either simulates original instruction or does out of line single step
execution of it
- returns to user space
The optimized uprobe path does following:
- checks the original instruction is 5-byte nop (plus other checks)
- adds (or uses existing) user space trampoline with uprobe syscall
- overwrites original instruction (5-byte nop) with call to user space
trampoline
- the user space trampoline executes uprobe syscall that calls related uprobe
consumers
- trampoline returns back to next instruction
This approach won't speed up all uprobes as it's limited to using nop5 as
original instruction, but we plan to use nop5 as USDT probe instruction
(which currently uses single byte nop) and speed up the USDT probes.
The arch_uprobe_optimize triggers the uprobe optimization and is called after
first uprobe hit. I originally had it called on uprobe installation but then
it clashed with elf loader, because the user space trampoline was added in a
place where loader might need to put elf segments, so I decided to do it after
first uprobe hit when loading is done.
The uprobe is un-optimized in arch specific set_orig_insn call.
The instruction overwrite is x86 arch specific and needs to go through 3 updates:
(on top of nop5 instruction)
- write int3 into 1st byte
- write last 4 bytes of the call instruction
- update the call instruction opcode
And cleanup goes though similar reverse stages:
- overwrite call opcode with breakpoint (int3)
- write last 4 bytes of the nop5 instruction
- write the nop5 first instruction byte
We do not unmap and release uprobe trampoline when it's no longer needed,
because there's no easy way to make sure none of the threads is still
inside the trampoline. But we do not waste memory, because there's just
single page for all the uprobe trampoline mappings.
We do waste frame on page mapping for every 4GB by keeping the uprobe
trampoline page mapped, but that seems ok.
We take the benefit from the fact that set_swbp and set_orig_insn are
called under mmap_write_lock(mm), so we can use the current instruction
as the state the uprobe is in - nop5/breakpoint/call trampoline -
and decide the needed action (optimize/un-optimize) based on that.
Attaching the speed up from benchs/run_bench_uprobes.sh script:
current:
usermode-count : 152.604 ± 0.044M/s
syscall-count : 13.359 ± 0.042M/s
--> uprobe-nop : 3.229 ± 0.002M/s
uprobe-push : 3.086 ± 0.004M/s
uprobe-ret : 1.114 ± 0.004M/s
uprobe-nop5 : 1.121 ± 0.005M/s
uretprobe-nop : 2.145 ± 0.002M/s
uretprobe-push : 2.070 ± 0.001M/s
uretprobe-ret : 0.931 ± 0.001M/s
uretprobe-nop5 : 0.957 ± 0.001M/s
after the change:
usermode-count : 152.448 ± 0.244M/s
syscall-count : 14.321 ± 0.059M/s
uprobe-nop : 3.148 ± 0.007M/s
uprobe-push : 2.976 ± 0.004M/s
uprobe-ret : 1.068 ± 0.003M/s
--> uprobe-nop5 : 7.038 ± 0.007M/s
uretprobe-nop : 2.109 ± 0.004M/s
uretprobe-push : 2.035 ± 0.001M/s
uretprobe-ret : 0.908 ± 0.001M/s
uretprobe-nop5 : 3.377 ± 0.009M/s
I see bit more speed up on Intel (above) compared to AMD. The big nop5
speed up is partly due to emulating nop5 and partly due to optimization.
The key speed up we do this for is the USDT switch from nop to nop5:
uprobe-nop : 3.148 ± 0.007M/s
uprobe-nop5 : 7.038 ± 0.007M/s
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250720112133.244369-11-jolsa@kernel.org
|
|
Adding new uprobe syscall that calls uprobe handlers for given
'breakpoint' address.
The idea is that the 'breakpoint' address calls the user space
trampoline which executes the uprobe syscall.
The syscall handler reads the return address of the initial call
to retrieve the original 'breakpoint' address. With this address
we find the related uprobe object and call its consumers.
Adding the arch_uprobe_trampoline_mapping function that provides
uprobe trampoline mapping. This mapping is backed with one global
page initialized at __init time and shared by the all the mapping
instances.
We do not allow to execute uprobe syscall if the caller is not
from uprobe trampoline mapping.
The uprobe syscall ensures the consumer (bpf program) sees registers
values in the state before the trampoline was called.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250720112133.244369-10-jolsa@kernel.org
|
|
Adding support to add special mapping for user space trampoline with
following functions:
uprobe_trampoline_get - find or add uprobe_trampoline
uprobe_trampoline_put - remove or destroy uprobe_trampoline
The user space trampoline is exported as arch specific user space special
mapping through tramp_mapping, which is initialized in following changes
with new uprobe syscall.
The uprobe trampoline needs to be callable/reachable from the probed address,
so while searching for available address we use is_reachable_by_call function
to decide if the uprobe trampoline is callable from the probe address.
All uprobe_trampoline objects are stored in uprobes_state object and are
cleaned up when the process mm_struct goes down. Adding new arch hooks
for that, because this change is x86_64 specific.
Locking is provided by callers in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250720112133.244369-9-jolsa@kernel.org
|
|
Making update_ref_ctr call in uprobe_write conditional based
on do_ref_ctr argument. This way we can use uprobe_write for
instruction update without doing ref_ctr_offset update.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-8-jolsa@kernel.org
|
|
The uprobe_write has special path to restore the original page when we
write original instruction back. This happens when uprobe_write detects
that we want to write anything else but breakpoint instruction.
Moving the detection away and passing it to uprobe_write as argument,
so it's possible to write different instructions (other than just
breakpoint and rest).
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-7-jolsa@kernel.org
|
|
Adding nbytes argument to uprobe_write and related functions as
preparation for writing whole instructions in following changes.
Also renaming opcode arguments to insn, which seems to fit better.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-6-jolsa@kernel.org
|
|
Adding uprobe_write function that does what uprobe_write_opcode did
so far, but allows to pass verify callback function that checks the
memory location before writing the opcode.
It will be used in following changes to implement specific checking
logic for instruction update.
The uprobe_write_opcode now calls uprobe_write with verify_opcode as
the verify callback.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-5-jolsa@kernel.org
|
|
Making copy_from_page global and adding uprobe prefix.
Adding the uprobe prefix to copy_to_page as well for symmetry.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-4-jolsa@kernel.org
|
|
We are about to add uprobe trampoline, so cleaning up the namespace.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-3-jolsa@kernel.org
|
|
Currently unapply_uprobe takes mmap_read_lock, but it might call
remove_breakpoint which eventually changes user pages.
Current code writes either breakpoint or original instruction, so it can
go away with read lock as explained in here [1]. But with the upcoming
change that writes multiple instructions on the probed address we need
to ensure that any update to mm's pages is exclusive.
[1] https://lore.kernel.org/all/20240710140045.GA1084@redhat.com/
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-2-jolsa@kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fix from Masami Hiramatsu:
"Sanitize wildcard for fprobe event name
Fprobe event accepts wildcards for the target functions, but unless
the user specifies its event name, it makes an event with the
wildcards. Replace the wildcard '*' with the underscore '_'"
* tag 'probes-fixes-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: fprobe-event: Sanitize wildcard for fprobe event name
|
|
Fprobe event accepts wildcards for the target functions, but unless user
specifies its event name, it makes an event with the wildcards.
/sys/kernel/tracing # echo 'f mutex*' >> dynamic_events
/sys/kernel/tracing # cat dynamic_events
f:fprobes/mutex*__entry mutex*
/sys/kernel/tracing # ls events/fprobes/
enable filter mutex*__entry
To fix this, replace the wildcard ('*') with an underscore.
Link: https://lore.kernel.org/all/175535345114.282990.12294108192847938710.stgit@devnote2/
Fixes: 334e5519c375 ("tracing/probes: Add fprobe events for tracing function entry and exit.")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
|
|
This warning was triggered during testing on v6.16:
notifier callback ftrace_suspend_notifier_call already registered
WARNING: CPU: 2 PID: 86 at kernel/notifier.c:23 notifier_chain_register+0x44/0xb0
...
Call Trace:
<TASK>
blocking_notifier_chain_register+0x34/0x60
register_ftrace_graph+0x330/0x410
ftrace_profile_write+0x1e9/0x340
vfs_write+0xf8/0x420
? filp_flush+0x8a/0xa0
? filp_close+0x1f/0x30
? do_dup2+0xaf/0x160
ksys_write+0x65/0xe0
do_syscall_64+0xa4/0x260
entry_SYSCALL_64_after_hwframe+0x77/0x7f
When writing to the function_profile_enabled interface, the notifier was
not unregistered after start_graph_tracing failed, causing a warning the
next time function_profile_enabled was written.
Fixed by adding unregister_pm_notifier in the exception path.
Link: https://lore.kernel.org/20250818073332.3890629-1-yeweihua4@huawei.com
Fixes: 4a2b8dda3f870 ("tracing/function-graph-tracer: fix a regression while suspend to disk")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Ye Weihua <yeweihua4@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Remove unnecessary semicolons.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250813095114.559530-1-liaoyuanhong@vivo.com
Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When the length of the string written to set_ftrace_filter exceeds
FTRACE_BUFF_MAX, the following KASAN alarm will be triggered:
BUG: KASAN: slab-out-of-bounds in strsep+0x18c/0x1b0
Read of size 1 at addr ffff0000d00bd5ba by task ash/165
CPU: 1 UID: 0 PID: 165 Comm: ash Not tainted 6.16.0-g6bcdbd62bd56-dirty
Hardware name: linux,dummy-virt (DT)
Call trace:
show_stack+0x34/0x50 (C)
dump_stack_lvl+0xa0/0x158
print_address_description.constprop.0+0x88/0x398
print_report+0xb0/0x280
kasan_report+0xa4/0xf0
__asan_report_load1_noabort+0x20/0x30
strsep+0x18c/0x1b0
ftrace_process_regex.isra.0+0x100/0x2d8
ftrace_regex_release+0x484/0x618
__fput+0x364/0xa58
____fput+0x28/0x40
task_work_run+0x154/0x278
do_notify_resume+0x1f0/0x220
el0_svc+0xec/0xf0
el0t_64_sync_handler+0xa0/0xe8
el0t_64_sync+0x1ac/0x1b0
The reason is that trace_get_user will fail when processing a string
longer than FTRACE_BUFF_MAX, but not set the end of parser->buffer to 0.
Then an OOB access will be triggered in ftrace_regex_release->
ftrace_process_regex->strsep->strpbrk. We can solve this problem by
limiting access to parser->buffer when trace_get_user failed.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250813040232.1344527-1-pulehui@huaweicloud.com
Fixes: 8c9af478c06b ("ftrace: Handle commands when closing set_ftrace_filter file")
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
During boot scratch area is allocated based on command line parameters or
auto calculated. However, scratch area may fail to allocate, and in that
case KHO is disabled. Currently, no warning is printed that KHO is
disabled, which makes it confusing for the end user to figure out why KHO
is not available. Add the missing warning message.
Link: https://lkml.kernel.org/r/20250808201804.772010-4-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
KHO uses struct pages for the preserved memory early in boot, however,
with deferred struct page initialization, only a small portion of memory
has properly initialized struct pages.
This problem was detected where vmemmap is poisoned, and illegal flag
combinations are detected.
Don't allow them to be enabled together, and later we will have to teach
KHO to work properly with deferred struct page init kernel feature.
Link: https://lkml.kernel.org/r/20250808201804.772010-3-pasha.tatashin@soleen.com
Fixes: 4e1d010e3bda ("kexec: add config option for KHO")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Several KHO Hotfixes".
Three unrelated fixes for Kexec Handover.
This patch (of 3):
Lockdep shows the following warning:
INFO: trying to register non-static key. The code is fine but needs
lockdep annotation, or maybe you didn't initialize this object before use?
turning off the locking correctness validator.
[<ffffffff810133a6>] dump_stack_lvl+0x66/0xa0
[<ffffffff8136012c>] assign_lock_key+0x10c/0x120
[<ffffffff81358bb4>] register_lock_class+0xf4/0x2f0
[<ffffffff813597ff>] __lock_acquire+0x7f/0x2c40
[<ffffffff81360cb0>] ? __pfx_hlock_conflict+0x10/0x10
[<ffffffff811707be>] ? native_flush_tlb_global+0x8e/0xa0
[<ffffffff8117096e>] ? __flush_tlb_all+0x4e/0xa0
[<ffffffff81172fc2>] ? __kernel_map_pages+0x112/0x140
[<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
[<ffffffff81359556>] lock_acquire+0xe6/0x280
[<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
[<ffffffff8100b9e0>] _raw_spin_lock+0x30/0x40
[<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
[<ffffffff813ec327>] xa_load_or_alloc+0x67/0xe0
[<ffffffff813eb4c0>] kho_preserve_folio+0x90/0x100
[<ffffffff813ebb7f>] __kho_finalize+0xcf/0x400
[<ffffffff813ebef4>] kho_finalize+0x34/0x70
This is becase xa has its own lock, that is not initialized in
xa_load_or_alloc.
Modifiy __kho_preserve_order(), to properly call
xa_init(&new_physxa->phys_bits);
Link: https://lkml.kernel.org/r/20250808201804.772010-2-pasha.tatashin@soleen.com
Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix two memory leaks in pidfs
- Prevent changing the idmapping of an already idmapped mount without
OPEN_TREE_CLONE through open_tree_attr()
- Don't fail listing extended attributes in kernfs when no extended
attributes are set
- Fix the return value in coredump_parse()
- Fix the error handling for unbuffered writes in netfs
- Fix broken data integrity guarantees for O_SYNC writes via iomap
- Fix UAF in __mark_inode_dirty()
- Keep inode->i_blkbits constant in fuse
- Fix coredump selftests
- Fix get_unused_fd_flags() usage in do_handle_open()
- Rename EXPORT_SYMBOL_GPL_FOR_MODULES to EXPORT_SYMBOL_FOR_MODULES
- Fix use-after-free in bh_read()
- Fix incorrect lflags value in the move_mount() syscall
* tag 'vfs-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
signal: Fix memory leak for PIDFD_SELF* sentinels
kernfs: don't fail listing extended attributes
coredump: Fix return value in coredump_parse()
fs/buffer: fix use-after-free when call bh_read() helper
pidfs: Fix memory leak in pidfd_info()
netfs: Fix unbuffered write error handling
fhandle: do_handle_open() should get FD with user flags
module: Rename EXPORT_SYMBOL_GPL_FOR_MODULES to EXPORT_SYMBOL_FOR_MODULES
fs: fix incorrect lflags value in the move_mount syscall
selftests/coredump: Remove the read() that fails the test
fuse: keep inode->i_blkbits constant
iomap: Fix broken data integrity guarantees for O_SYNC writes
selftests/mount_setattr: add smoke tests for open_tree_attr(2) bug
open_tree_attr: do not allow id-mapping changes without OPEN_TREE_CLONE
fs: writeback: fix use-after-free in __mark_inode_dirty()
|
|
Commit f08d0c3a7111 ("pidfd: add PIDFD_SELF* sentinels to refer to own
thread/process") introduced a leak by acquiring a pid reference through
get_task_pid(), which increments pid->count but never drops it with
put_pid().
As a result, kmemleak reports unreferenced pid objects after running
tools/testing/selftests/pidfd/pidfd_test, for example:
unreferenced object 0xff1100206757a940 (size 160):
comm "pidfd_test", pid 16965, jiffies 4294853028
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 00 00 00 fd 57 50 04 .............WP.
5e 44 00 00 00 00 00 00 18 de 34 17 01 00 11 ff ^D........4.....
backtrace (crc cd8844d4):
kmem_cache_alloc_noprof+0x2f4/0x3f0
alloc_pid+0x54/0x3d0
copy_process+0xd58/0x1740
kernel_clone+0x99/0x3b0
__do_sys_clone3+0xbe/0x100
do_syscall_64+0x7b/0x2c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix this by calling put_pid() after do_pidfd_send_signal() returns.
Fixes: f08d0c3a7111 ("pidfd: add PIDFD_SELF* sentinels to refer to own thread/process")
Signed-off-by: Adrian Huang (Lenovo) <adrianhuang0701@gmail.com>
Link: https://lore.kernel.org/20250818134310.12273-1-adrianhuang0701@gmail.com
Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
to simplify the code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/20250810173615.GA20000@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
task_pid_vnr(another_task) will crash if the caller was already reaped.
The pid_alive(current) check can't really help, the parent/debugger can
call release_task() right after this check.
This also means that even task_ppid_nr_ns(current, NULL) is not safe,
pid_alive() only ensures that it is safe to dereference ->real_parent.
Change __task_pid_nr_ns() to ensure ns != NULL.
Originally-by: 高翔 <gaoxiang17@xiaomi.com>
Link: https://lore.kernel.org/all/20250802022123.3536934-1-gxxa03070307@gmail.com/
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/20250810173604.GA19991@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
__task_pid_nr_ns
ns = task_active_pid_ns(current);
pid_nr_ns(rcu_dereference(*task_pid_ptr(task, type)), ns);
if (pid && ns->level <= pid->level) {
Sometimes null is returned for task_active_pid_ns. Then it will trigger kernel panic in pid_nr_ns.
For example:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
Mem abort info:
ESR = 0x0000000096000007
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x07: level 3 translation fault
Data abort info:
ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000
CM = 0, WnR = 0, TnD = 0, TagAccess = 0
GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
user pgtable: 4k pages, 39-bit VAs, pgdp=00000002175aa000
[0000000000000058] pgd=08000002175ab003, p4d=08000002175ab003, pud=08000002175ab003, pmd=08000002175be003, pte=0000000000000000
pstate: 834000c5 (Nzcv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
pc : __task_pid_nr_ns+0x74/0xd0
lr : __task_pid_nr_ns+0x24/0xd0
sp : ffffffc08001bd10
x29: ffffffc08001bd10 x28: ffffffd4422b2000 x27: 0000000000000001
x26: ffffffd442821168 x25: ffffffd442821000 x24: 00000f89492eab31
x23: 00000000000000c0 x22: ffffff806f5693c0 x21: ffffff806f5693c0
x20: 0000000000000001 x19: 0000000000000000 x18: 0000000000000000
x17: 00000000529c6ef0 x16: 00000000529c6ef0 x15: 00000000023a1adc
x14: 0000000000000003 x13: 00000000007ef6d8 x12: 001167c391c78800
x11: 00ffffffffffffff x10: 0000000000000000 x9 : 0000000000000001
x8 : ffffff80816fa3c0 x7 : 0000000000000000 x6 : 49534d702d535449
x5 : ffffffc080c4c2c0 x4 : ffffffd43ee128c8 x3 : ffffffd43ee124dc
x2 : 0000000000000000 x1 : 0000000000000001 x0 : ffffff806f5693c0
Call trace:
__task_pid_nr_ns+0x74/0xd0
...
__handle_irq_event_percpu+0xd4/0x284
handle_irq_event+0x48/0xb0
handle_fasteoi_irq+0x160/0x2d8
generic_handle_domain_irq+0x44/0x60
gic_handle_irq+0x4c/0x114
call_on_irq_stack+0x3c/0x74
do_interrupt_handler+0x4c/0x84
el1_interrupt+0x34/0x58
el1h_64_irq_handler+0x18/0x24
el1h_64_irq+0x68/0x6c
account_kernel_stack+0x60/0x144
exit_task_stack_account+0x1c/0x80
do_exit+0x7e4/0xaf8
...
get_signal+0x7bc/0x8d8
do_notify_resume+0x128/0x828
el0_svc+0x6c/0x70
el0t_64_sync_handler+0x68/0xbc
el0t_64_sync+0x1a8/0x1ac
Code: 35fffe54 911a02a8 f9400108 b4000128 (b9405a69)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception in interrupt
Signed-off-by: gaoxiang17 <gaoxiang17@xiaomi.com>
Link: https://lore.kernel.org/20250802022123.3536934-1-gxxa03070307@gmail.com
Reviewed-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
strcpy() is deprecated; use strscpy() instead.
Link: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Marco Elver <elver@google.com>
|
|
Merge 'skb-meta-dynptr' branch into 'master' branch. No conflict.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Merge 'skb-meta-dynptr' branch into 'net' branch. No conflict.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Now that we can create a dynptr to skb metadata, make reads to the metadata
area possible with bpf_dynptr_read() or through a bpf_dynptr_slice(), and
make writes to the metadata area possible with bpf_dynptr_write() or
through a bpf_dynptr_slice_rdwr().
Note that for cloned skbs which share data with the original, we limit the
skb metadata dynptr to be read-only since we don't unclone on a
bpf_dynptr_write to metadata.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-2-8a39e636e0fb@cloudflare.com
|
|
Add a dynptr type, similar to skb dynptr, but for the skb metadata access.
The dynptr provides an alternative to __sk_buff->data_meta for accessing
the custom metadata area allocated using the bpf_xdp_adjust_meta() helper.
More importantly, it abstracts away the fact where the storage for the
custom metadata lives, which opens up the way to persist the metadata by
relocating it as the skb travels through the network stack layers.
Writes to skb metadata invalidate any existing skb payload and metadata
slices. While this is more restrictive that needed at the moment, it leaves
the door open to reallocating the metadata on writes, and should be only a
minor inconvenience to the users.
Only the program types which can access __sk_buff->data_meta today are
allowed to create a dynptr for skb metadata at the moment. We need to
modify the network stack to persist the metadata across layers before
opening up access to other BPF hooks.
Once more BPF hooks gain access to skb_meta dynptr, we will also need to
add a read-only variant of the helper similar to
bpf_dynptr_from_skb_rdonly.
skb_meta dynptr ops are stubbed out and implemented by subsequent changes.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-1-8a39e636e0fb@cloudflare.com
|
|
When a BPF program which is being loaded reaches the map limit
(MAX_USED_MAPS) or the BTF limit (MAX_USED_BTFS) the -E2BIG is
returned. However, in the former case there is an accompanying
verifier verbose message, and in the latter case there is not.
Add a verbose message to make the behaviour symmetrical.
Reported-by: Kevin Sheldrake <kevin.sheldrake@isovalent.com>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250816151554.902995-1-a.s.protopopov@gmail.com
|
|
The get_next_cpu() function was only used in one place to find
the next possible CPU, which can be replaced by cpumask_next_wrap().
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250818032344.23229-1-wangfushuai@baidu.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Borislav Petkov:
- Make sure sanity checks down in the mutex lock path happen on the
correct type of task so that they don't trigger falsely
- Use the write unsafe user access pairs when writing a futex value to
prevent an error on PowerPC which does user read and write accesses
differently
* tag 'locking_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking: Fix __clear_task_blocked_on() warning from __ww_mutex_wound() path
futex: Use user_write_access_begin/_end() in futex_put_value()
|
|
strcpy() is deprecated; use strscpy() and memcpy() instead.
In param_set_copystring(), we can safely use memcpy() because we already
know the length of the source string 'val' and that it is guaranteed to
be NUL-terminated within the first 'kps->maxlen' bytes.
Link: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Link: https://lore.kernel.org/r/20250813132200.184064-2-thorsten.blum@linux.dev
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
|
|
Graph tracer framework ensures we won't migrate, kprobe_multi_link_prog_run
called all the way from graph tracer, which disables preemption in
function_graph_enter_regs, as Jiri and Yonghong suggested, there is no
need to use migrate_disable. As a result, some overhead may will be reduced.
And add cant_sleep check for __this_cpu_inc_return.
Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250814121430.2347454-1-chen.dylane@linux.dev
|
|
The recently fixed reference count leaks could have been detected by using
refcount_t and refcount_t would have mitigated the potential overflow at
least.
Now that the code is properly structured, convert the mmap() related
mmap_count variants over to refcount_t.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104020.071507932@infradead.org
|
|
Needed because refcount_inc() doesn't allow the 0->1 transition.
Specifically, this is the case where we've created the RB, this means
there was no RB, and as such there could not have been an mmap.
Additionally we hold mmap_mutex to serialize everything.
This must be the first.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250812104019.956479989@infradead.org
|
|
Mostly just re-indent noise.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.838047976@infradead.org
|
|
Move the RB buffer allocation branch into its own function.
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.722214699@infradead.org
|
|
Ensure @rb usage doesn't extend out of the branch block.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.605285302@infradead.org
|
|
Move the AUX buffer allocation branch into its own function.
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.494205648@infradead.org
|
|
Mostly re-indent noise needed to get rid of that label.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.362581570@infradead.org
|
|
After duplicating the common code into the rb/aux branches is it
possible to use a simple guard() for the aux_mutex. Making the aux
branch self-contained.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.246250452@infradead.org
|