| Age | Commit message (Collapse) | Author | Lines |
|
The system call trace events call trace_user_fault_read() to read the user
space part of some system calls. This is done by grabbing a per-cpu
buffer, disabling migration, enabling preemption, calling
copy_from_user(), disabling preemption, enabling migration and checking if
the task was preempted while preemption was enabled. If it was, the buffer
is considered corrupted and it tries again.
There's a safety mechanism that will fail out of this loop if it fails 100
times (with a warning). That warning message was triggered in some
pi_futex stress tests. Enabling the sched_switch trace event and
traceoff_on_warning, showed the problem:
pi_mutex_hammer-1375 [006] d..21 138.981648: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981651: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981656: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981659: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981664: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981667: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981671: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981675: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981679: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981682: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981687: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981690: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981695: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981698: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981703: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981706: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981711: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981714: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981719: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981722: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981727: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981730: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981735: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981738: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
What happened was the task 1375 was flagged to be migrated. When
preemption was enabled, the migration thread woke up to migrate that task,
but failed because migration for that task was disabled. This caused the
loop to fail to exit because the task scheduled out while trying to read
user space.
Every time the task enabled preemption the migration thread would schedule
in, try to migrate the task, fail and let the task continue. But because
the loop would only enable preemption with migration disabled, it would
always fail because each time it enabled preemption to read user space,
the migration thread would try to migrate it.
To solve this, when the loop fails to read user space without being
scheduled out, enabled and disable preemption with migration enabled. This
will allow the migration task to successfully migrate the task and the
next loop should succeed to read user space without being scheduled out.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260316130734.1858a998@gandalf.local.home
Fixes: 64cf7d058a005 ("tracing: Have trace_marker use per-cpu data to read user space")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Resolve conflict between this change in the upstream kernel:
4c652a47722f ("rseq: Mark rseq_arm_slice_extension_timer() __always_inline")
... and this pending change in timers/core:
0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Restore the SPDX tag that was accidentally dropped.
Fixes: 7e4b6c94300e3 ("tracing: add more symbols to whitelist")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20260317194252.1890568-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Compiler and tooling-generated symbols are difficult to maintain
across all supported architectures. Make the allowlist more robust by
replacing the harcoded list with a mechanism that automatically detects
these symbols.
This mechanism generates a C function designed to trigger common
compiler-inserted symbols.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260316092845.3367411-1-vdonnefort@google.com
[maz: added __msan prefix to allowlist as pointed out by Arnd]
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Randconfig builds show a number of cryptic build errors from
hitting undefined symbols in simple_ring_buffer.o:
make[7]: *** [/home/arnd/arm-soc/kernel/trace/Makefile:147: kernel/trace/simple_ring_buffer.o.checked] Error 1
These happen with CONFIG_TRACE_BRANCH_PROFILING, CONFIG_KASAN_HW_TAGS,
CONFIG_STACKPROTECTOR, CONFIG_DEBUG_IRQFLAGS and indirectly from WARN_ON().
Add exceptions for each one that I have hit so far on arm64, x86_64 and arm
randconfig builds.
Other architectures likely hit additional ones, so it would be nice
to produce a little more verbose output that include the name of the
missing symbols directly.
Fixes: a717943d8ecc ("tracing: Check for undefined symbols in simple_ring_buffer")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260312123601.625063-2-arnd@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Undefined symbols are not allowed for simple_ring_buffer.c. But some
compiler emitted symbols are missing in the allowlist. Update it.
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Fixes: a717943d8ecc ("tracing: Check for undefined symbols in simple_ring_buffer")
Closes: https://lore.kernel.org/all/20260311221816.GA316631@ax162/
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20260312113535.2213350-1-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The sentinel value added by the wrapper macros __print_symbolic() et al
prevents the callers from adding their own trailing comma. This makes
constructing symbol list dynamically based on kconfig values tedious.
Drop the sentinel elements, so callers can either specify the trailing
comma or not, just like in regular array initializers.
Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-2-095357392669@linutronix.de
|
|
The simple_ring_buffer implementation must remain simple enough to be
used by the pKVM hypervisor. Prevent the object build if unresolved
symbols are found.
Link: https://patch.msgid.link/20260309162516.2623589-19-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add load/unload callback used for each admitted page in the ring-buffer.
This will be later useful for the pKVM hypervisor which uses a different
VA space and need to dynamically map/unmap the ring-buffer pages.
Link: https://patch.msgid.link/20260309162516.2623589-18-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add a module to help testing the tracefs support for trace remotes. This
module:
* Use simple_ring_buffer to write into a ring-buffer.
* Declare a single "selftest" event that can be triggered from
user-space.
* Register a "test" trace remote.
This is intended to be used by trace remote selftests.
Link: https://patch.msgid.link/20260309162516.2623589-15-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add a simple implementation of the kernel ring-buffer. This intends to
be used later by ring-buffer remotes such as the pKVM hypervisor, hence
the need for a cut down version (write only) without any dependency.
Link: https://patch.msgid.link/20260309162516.2623589-14-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
In preparation for allowing the writing of ring-buffer compliant pages
outside of ring_buffer.c, move buffer_data_page and timestamps encoding
macros into the publicly available ring_buffer_types.h.
Link: https://patch.msgid.link/20260309162516.2623589-13-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Just like for the kernel events directory, add 'enable', 'header_page'
and 'header_event' at the root of the trace remote events/ directory.
Link: https://patch.msgid.link/20260309162516.2623589-11-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
An event is predefined point in the writer code that allows to log
data. Following the same scheme as kernel events, add remote events,
described to user-space within the events/ tracefs directory found in
the corresponding trace remote.
Remote events are expected to be described during the trace remote
registration.
Add also a .enable_event callback for trace_remote to toggle the event
logging, if supported.
Link: https://patch.msgid.link/20260309162516.2623589-10-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add a .init call back so the trace remote callers can add entries to the
tracefs directory.
Link: https://patch.msgid.link/20260309162516.2623589-9-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Allow reading the trace file for trace remotes. This performs a
non-consuming read of the trace buffer.
Link: https://patch.msgid.link/20260309162516.2623589-8-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Allow to reset the trace remote buffer by writing to the Tracefs "trace"
file. This is similar to the regular Tracefs interface.
Link: https://patch.msgid.link/20260309162516.2623589-7-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
A trace remote relies on ring-buffer remotes to read and control
compatible tracing buffers, written by entity such as firmware or
hypervisor.
Add a Tracefs directory remotes/ that contains all instances of trace
remotes. Each instance follows the same hierarchy as any other to ease
the support by existing user-space tools.
This currently does not provide any event support, which will come
later.
Link: https://patch.msgid.link/20260309162516.2623589-6-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Hopefully, the remote will only swap pages on the kernel instruction (via
the swap_reader_page() callback). This means we know at what point the
ring-buffer geometry has changed. It is therefore possible to rearrange
the kernel view of that ring-buffer to allow non-consuming read.
Link: https://patch.msgid.link/20260309162516.2623589-5-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add ring-buffer remotes to support entities outside of the kernel (such
as firmware or a hypervisor) that writes events into a ring-buffer using
the tracefs format
Require a description of the ring-buffer pages (struct
trace_buffer_desc) and callbacks (swap_reader_page and reset) to set up
the ring-buffer on the kernel side.
Expect the remote entity to maintain and update the meta-page.
Link: https://patch.msgid.link/20260309162516.2623589-4-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The subbuf_ids field allows to point to a specific page from the
ring-buffer based on its ID. As a preparation or the upcoming
ring-buffer remote support, point this array to the buffer_page instead
of the buffer_data_page.
Link: https://patch.msgid.link/20260309162516.2623589-3-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add two fields pages_touched and pages_lost to the ring-buffer
meta-page. Those fields are useful to get the number of used pages in
the ring-buffer.
Link: https://patch.msgid.link/20260309162516.2623589-2-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Pull bpf fixes from Alexei Starovoitov:
- Fix u32/s32 bounds when ranges cross min/max boundary (Eduard
Zingerman)
- Fix precision backtracking with linked registers (Eduard Zingerman)
- Fix linker flags detection for resolve_btfids (Ihor Solodrai)
- Fix race in update_ftrace_direct_add/del (Jiri Olsa)
- Fix UAF in bpf_trampoline_link_cgroup_shim (Lang Xu)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
resolve_btfids: Fix linker flags detection
selftests/bpf: add reproducer for spurious precision propagation through calls
bpf: collect only live registers in linked regs
Revert "selftests/bpf: Update reg_bound range refinement logic"
selftests/bpf: test refining u32/s32 bounds when ranges cross min/max boundary
bpf: Fix u32/s32 bounds when ranges cross min/max boundary
bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim
ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix possible NULL pointer dereference in trace_data_alloc()
On the trace_data_alloc() error path, it can call trigger_data_free()
with a NULL pointer. This used to be a kfree() but was changed to
trigger_data_free() to clean up any partial initialization. The issue
is that trigger_data_free() does not expect a NULL pointer. Have
trigger_data_free() return safely on NULL pointer.
- Fix multiple events on the command line and bootconfig
If multiple events are enabled on the command line separately and not
grouped, only the last event gets enabled. That is:
trace_event=sched_switch trace_event=sched_waking
will only enable sched_waking whereas:
trace_event=sched_switch,sched_waking
will enable both.
The bootconfig makes it even worse as the second way is the more
common method.
The issue is that a temporary buffer is used to store the events to
enable later in boot. Each time the cmdline callback is called, it
overwrites what was previously there.
Have the callback append the next value (delimited by a comma) if the
temporary buffer already has content.
- Fix command line trace_buffer_size if >= 2G
The logic to allocate the trace buffer uses "int" for the size
parameter in the command line code causing overflow issues if more
that 2G is specified.
* tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G
tracing: Fix enabling multiple events on the kernel command line and bootconfig
tracing: Add NULL pointer check to trigger_data_free()
|
|
Some of the sizing logic through tracer_alloc_buffers() uses int
internally, causing unexpected behavior if the user passes a value that
does not fit in an int (on my x86 machine, the result is uselessly tiny
buffers).
Fix by plumbing the parameter's real type (unsigned long) through to the
ring buffer allocation functions, which already use unsigned long.
It has always been possible to create larger ring buffers via the sysfs
interface: this only affects the cmdline parameter.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/bff42a4288aada08bdf74da3f5b67a2c28b761f8.1772852067.git.calvin@wbinvd.org
Fixes: 73c5162aa362 ("tracing: keep ring buffer to minimum size till used")
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Multiple events can be enabled on the kernel command line via a comma
separator. But if the are specified one at a time, then only the last
event is enabled. This is because the event names are saved in a temporary
buffer, and each call by the init cmdline code will reset that buffer.
This also affects names in the boot config file, as it may call the
callback multiple times with an example of:
kernel.trace_event = ":mod:rproc_qcom_common", ":mod:qrtr", ":mod:qcom_aoss"
Change the cmdline callback function to append a comma and the next value
if the temporary buffer already has content.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260302-trace-events-allow-multiple-modules-v1-1-ce4436e37fb8@oss.qualcomm.com
Signed-off-by: Andrei-Alexandru Tachici <andrei-alexandru.tachici@oss.qualcomm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
If trigger_data_alloc() fails and returns NULL, event_hist_trigger_parse()
jumps to the out_free error path. While kfree() safely handles a NULL
pointer, trigger_data_free() does not. This causes a NULL pointer
dereference in trigger_data_free() when evaluating
data->cmd_ops->set_filter.
Fix the problem by adding a NULL pointer check to trigger_data_free().
The problem was found by an experimental code review agent based on
gemini-3.1-pro while reviewing backports into v6.18.y.
Cc: Miaoqian Lin <linmq006@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20260305193339.2810953-1-linux@roeck-us.net
Fixes: 0550069cc25f ("tracing: Properly process error handling in event_hist_trigger_parse()")
Assisted-by: Gemini:gemini-3.1-pro
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Improve quirk visibility and configurability (Maurizio)
- Fix runtime user modification to queue setup (Keith)
- Fix multipath leak on try_module_get failure (Keith)
- Ignore ambiguous spec definitions for better atomics support
(John)
- Fix admin queue leak on controller reset (Ming)
- Fix large allocation in persistent reservation read keys
(Sungwoo Kim)
- Fix fcloop callback handling (Justin)
- Securely free DHCHAP secrets (Daniel)
- Various cleanups and typo fixes (John, Wilfred)
- Avoid a circular lock dependency issue in the sysfs nr_requests or
scheduler store handling
- Fix a circular lock dependency with the pcpu mutex and the queue
freeze lock
- Cleanup for bio_copy_kern(), using __bio_add_page() rather than the
bio_add_page(), as adding a page here cannot fail. The exiting code
had broken cleanup for the error condition, so make it clear that the
error condition cannot happen
- Fix for a __this_cpu_read() in preemptible context splat
* tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
block: use trylock to avoid lockdep circular dependency in sysfs
nvme: fix memory allocation in nvme_pr_read_keys()
block: use __bio_add_page in bio_copy_kern
block: break pcpu_alloc_mutex dependency on freeze_lock
blktrace: fix __this_cpu_read/write in preemptible context
nvme-multipath: fix leak on try_module_get failure
nvmet-fcloop: Check remoteport port_state before calling done callback
nvme-pci: do not try to add queue maps at runtime
nvme-pci: cap queue creation to used queues
nvme-pci: ensure we're polling a polled queue
nvme: fix memory leak in quirks_param_set()
nvme: correct comment about nvme_ns_remove()
nvme: stop setting namespace gendisk device driver data
nvme: add support for dynamic quirk configuration via module parameter
nvme: fix admin queue leak on controller reset
nvme-fabrics: use kfree_sensitive() for DHCHAP secrets
nvme: stop using AWUPF
nvme: expose active quirks in sysfs
nvme/host: fixup some typos
|
|
When a process forks, the child process copies the parent's VMAs but the
user_mapped reference count is not incremented. As a result, when both the
parent and child processes exit, tracing_buffers_mmap_close() is called
twice. On the second call, user_mapped is already 0, causing the function to
return -ENODEV and triggering a WARN_ON.
Normally, this isn't an issue as the memory is mapped with VM_DONTCOPY set.
But this is only a hint, and the application can call
madvise(MADVISE_DOFORK) which resets the VM_DONTCOPY flag. When the
application does that, it can trigger this issue on fork.
Fix it by incrementing the user_mapped reference count without re-mapping
the pages in the VMA's open callback.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://patch.msgid.link/20260227025842.1085206-1-wangqing7171@gmail.com
Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer")
Reported-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3b5dd2030fe08afdf65d
Tested-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Filtering PIDs for events triggered the following during selftests:
[37] event tracing - restricts events based on pid notrace filtering
[ 155.874095]
[ 155.874869] =============================
[ 155.876037] WARNING: suspicious RCU usage
[ 155.877287] 7.0.0-rc1-00004-g8cd473a19bc7 #7 Not tainted
[ 155.879263] -----------------------------
[ 155.882839] kernel/trace/trace_events.c:1057 suspicious rcu_dereference_check() usage!
[ 155.889281]
[ 155.889281] other info that might help us debug this:
[ 155.889281]
[ 155.894519]
[ 155.894519] rcu_scheduler_active = 2, debug_locks = 1
[ 155.898068] no locks held by ftracetest/4364.
[ 155.900524]
[ 155.900524] stack backtrace:
[ 155.902645] CPU: 1 UID: 0 PID: 4364 Comm: ftracetest Not tainted 7.0.0-rc1-00004-g8cd473a19bc7 #7 PREEMPT(lazy)
[ 155.902648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
[ 155.902651] Call Trace:
[ 155.902655] <TASK>
[ 155.902659] dump_stack_lvl+0x67/0x90
[ 155.902665] lockdep_rcu_suspicious+0x154/0x1a0
[ 155.902672] event_filter_pid_sched_process_fork+0x9a/0xd0
[ 155.902678] kernel_clone+0x367/0x3a0
[ 155.902689] __x64_sys_clone+0x116/0x140
[ 155.902696] do_syscall_64+0x158/0x460
[ 155.902700] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 155.902702] ? trace_irq_disable+0x1d/0xc0
[ 155.902709] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 155.902711] RIP: 0033:0x4697c3
[ 155.902716] Code: 1f 84 00 00 00 00 00 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75 2c 64 48 8b 04 25 10 00 00
[ 155.902718] RSP: 002b:00007ffc41150428 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[ 155.902721] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004697c3
[ 155.902722] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[ 155.902724] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000003fccf990
[ 155.902725] R10: 000000003fccd690 R11: 0000000000000246 R12: 0000000000000001
[ 155.902726] R13: 000000003fce8103 R14: 0000000000000001 R15: 0000000000000000
[ 155.902733] </TASK>
[ 155.902747]
The tracepoint callbacks recently were changed to allow preemption. The
event PID filtering callbacks that were attached to the fork and exit
tracepoints expected preemption disabled in order to access the RCU
protected PID lists.
Add a guard(preempt)() to protect the references to the PID list.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260303215738.6ab275af@fedora
Fixes: a46023d5616e ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
Link: https://patch.msgid.link/20260303131706.96057f61a48a34c43ce1e396@kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When function trace PID filtering is enabled, the function tracer will
attach a callback to the fork tracepoint as well as the exit tracepoint
that will add the forked child PID to the PID filtering list as well as
remove the PID that is exiting.
Commit a46023d5616e ("tracing: Guard __DECLARE_TRACE() use of
__DO_TRACE_CALL() with SRCU-fast") removed the disabling of preemption
when calling tracepoint callbacks.
The callbacks used for the PID filtering accounting depended on preemption
being disabled, and now the trigger a "suspicious RCU usage" warning message.
Make them explicitly disable preemption.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260302213546.156e3e4f@gandalf.local.home
Fixes: a46023d5616e ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
When multiple syscall events are specified in the kernel command line
(e.g., trace_event=syscalls:sys_enter_openat,syscalls:sys_enter_close),
they are often not captured after boot, even though they appear enabled
in the tracing/set_event file.
The issue stems from how syscall events are initialized. Syscall
tracepoints require the global reference count (sys_tracepoint_refcount)
to transition from 0 to 1 to trigger the registration of the syscall
work (TIF_SYSCALL_TRACEPOINT) for tasks, including the init process (pid 1).
The current implementation of early_enable_events() with disable_first=true
used an interleaved sequence of "Disable A -> Enable A -> Disable B -> Enable B".
If multiple syscalls are enabled, the refcount never drops to zero,
preventing the 0->1 transition that triggers actual registration.
Fix this by splitting early_enable_events() into two distinct phases:
1. Disable all events specified in the buffer.
2. Enable all events specified in the buffer.
This ensures the refcount hits zero before re-enabling, allowing syscall
events to be properly activated during early boot.
The code is also refactored to use a helper function to avoid logic
duplication between the disable and enable phases.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260224023544.1250787-1-hehuiwen@kylinos.cn
Fixes: ce1039bd3a89 ("tracing: Fix enabling of syscall events on the command line")
Signed-off-by: Huiwen He <hehuiwen@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
trace_graph_thresh_return() called handle_nosleeptime() and then delegated
to trace_graph_return(), which calls handle_nosleeptime() again. When
sleep-time accounting is disabled this double-adjusts calltime and can
produce bogus durations (including underflow).
Fix this by computing rettime once, applying handle_nosleeptime() only
once, using the adjusted calltime for threshold comparison, and writing
the return event directly via __trace_graph_return() when the threshold is
met.
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260221113314048jE4VRwIyZEALiYByGK0My@zte.com.cn
Fixes: 3c9880f3ab52b ("ftrace: Use a running sleeptime instead of saving on shadow stack")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When tracing_thresh is enabled, function graph tracing uses
trace_graph_thresh_return() as the return handler. Unlike
trace_graph_return(), it did not clear the per-task TRACE_GRAPH_NOTRACE
flag set by the entry handler for set_graph_notrace addresses. This could
leave the task permanently in "notrace" state and effectively disable
function graph tracing for that task.
Mirror trace_graph_return()'s per-task notrace handling by clearing
TRACE_GRAPH_NOTRACE and returning early when set.
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260221113007819YgrZsMGABff4Rc-O_fZxL@zte.com.cn
Fixes: b84214890a9bc ("function_graph: Move graph notrace bit to shadow stack global var")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Ihor and Kumar reported splat from ftrace_get_addr_curr [1], which happened
because of the missing ftrace_lock in update_ftrace_direct_add/del functions
allowing concurrent access to ftrace internals.
The ftrace_update_ops function must be guarded by ftrace_lock, adding that.
Fixes: 05dc5e9c1fe1 ("ftrace: Add update_ftrace_direct_add function")
Fixes: 8d2c1233f371 ("ftrace: Add update_ftrace_direct_del function")
Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Closes: https://lore.kernel.org/bpf/1b58ffb2-92ae-433a-ba46-95294d6edea2@linux.dev/
Tested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20260302081622.165713-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
tracing_record_cmdline() internally uses __this_cpu_read() and
__this_cpu_write() on the per-CPU variable trace_cmdline_save, and
trace_save_cmdline() explicitly asserts preemption is disabled via
lockdep_assert_preemption_disabled(). These operations are only safe
when preemption is off, as they were designed to be called from the
scheduler context (probe_wakeup_sched_switch() / probe_wakeup()).
__blk_add_trace() was calling tracing_record_cmdline(current) early in
the blk_tracer path, before ring buffer reservation, from process
context where preemption is fully enabled. This triggers the following
using blktests/blktrace/002:
blktrace/002 (blktrace ftrace corruption with sysfs trace) [failed]
runtime 0.367s ... 0.437s
something found in dmesg:
[ 81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33
[ 81.239580] null_blk: disk nullb1 created
[ 81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516
[ 81.362842] caller is tracing_record_cmdline+0x10/0x40
[ 81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G N 7.0.0-rc1lblk+ #84 PREEMPT(full)
[ 81.362877] Tainted: [N]=TEST
[ 81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[ 81.362881] Call Trace:
[ 81.362884] <TASK>
[ 81.362886] dump_stack_lvl+0x8d/0xb0
...
(See '/mnt/sda/blktests/results/nodev/blktrace/002.dmesg' for the entire message)
[ 81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33
[ 81.239580] null_blk: disk nullb1 created
[ 81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516
[ 81.362842] caller is tracing_record_cmdline+0x10/0x40
[ 81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G N 7.0.0-rc1lblk+ #84 PREEMPT(full)
[ 81.362877] Tainted: [N]=TEST
[ 81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[ 81.362881] Call Trace:
[ 81.362884] <TASK>
[ 81.362886] dump_stack_lvl+0x8d/0xb0
[ 81.362895] check_preemption_disabled+0xce/0xe0
[ 81.362902] tracing_record_cmdline+0x10/0x40
[ 81.362923] __blk_add_trace+0x307/0x5d0
[ 81.362934] ? lock_acquire+0xe0/0x300
[ 81.362940] ? iov_iter_extract_pages+0x101/0xa30
[ 81.362959] blk_add_trace_bio+0x106/0x1e0
[ 81.362968] submit_bio_noacct_nocheck+0x24b/0x3a0
[ 81.362979] ? lockdep_init_map_type+0x58/0x260
[ 81.362988] submit_bio_wait+0x56/0x90
[ 81.363009] __blkdev_direct_IO_simple+0x16c/0x250
[ 81.363026] ? __pfx_submit_bio_wait_endio+0x10/0x10
[ 81.363038] ? rcu_read_lock_any_held+0x73/0xa0
[ 81.363051] blkdev_read_iter+0xc1/0x140
[ 81.363059] vfs_read+0x20b/0x330
[ 81.363083] ksys_read+0x67/0xe0
[ 81.363090] do_syscall_64+0xbf/0xf00
[ 81.363102] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 81.363106] RIP: 0033:0x7f281906029d
[ 81.363111] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d 66 63 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 41 33 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
[ 81.363113] RSP: 002b:00007ffca127dd48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 81.363120] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281906029d
[ 81.363122] RDX: 0000000000001000 RSI: 0000559f8bfae000 RDI: 0000000000000000
[ 81.363123] RBP: 0000000000001000 R08: 0000002863a10a81 R09: 00007f281915f000
[ 81.363124] R10: 00007f2818f77b60 R11: 0000000000000246 R12: 0000559f8bfae000
[ 81.363126] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000000000a
[ 81.363142] </TASK>
The same BUG fires from blk_add_trace_plug(), blk_add_trace_unplug(),
and blk_add_trace_rq() paths as well.
The purpose of tracing_record_cmdline() is to cache the task->comm for
a given PID so that the trace can later resolve it. It is only
meaningful when a trace event is actually being recorded. Ring buffer
reservation via ring_buffer_lock_reserve() disables preemption, and
preemption remains disabled until the event is committed :-
__blk_add_trace()
__trace_buffer_lock_reserve()
__trace_buffer_lock_reserve()
ring_buffer_lock_reserve()
preempt_disable_notrace(); <---
With this fix blktests for blktrace pass:
blktests (master) # ./check blktrace
blktrace/001 (blktrace zone management command tracing) [passed]
runtime 3.650s ... 3.647s
blktrace/002 (blktrace ftrace corruption with sysfs trace) [passed]
runtime 0.411s ... 0.384s
Fixes: 7ffbd48d5cab ("tracing: Cache comms only after an event occurred")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We don't check if cookies are available on the kprobe_multi link
before accessing them in show_fdinfo callback, we should.
Cc: stable@vger.kernel.org
Fixes: da7e9c0a7fbc ("bpf: Add show_fdinfo for kprobe_multi")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260225111249.186230-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Conversion performed via this Coccinelle script:
// SPDX-License-Identifier: GPL-2.0-only
// Options: --include-headers-for-types --all-includes --include-headers --keep-comments
virtual patch
@gfp depends on patch && !(file in "tools") && !(file in "samples")@
identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
kzalloc_obj,kzalloc_objs,kzalloc_flex,
kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
@@
ALLOC(...
- , GFP_KERNEL
)
$ make coccicheck MODE=patch COCCI=gfp.cocci
Build and boot tested x86_64 with Fedora 42's GCC and Clang:
Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix possible dereference of uninitialized pointer
When validating the persistent ring buffer on boot up, if the first
validation fails, a reference to "head_page" is performed in the
error path, but it skips over the initialization of that variable.
Move the initialization before the first validation check.
- Fix use of event length in validation of persistent ring buffer
On boot up, the persistent ring buffer is checked to see if it is
valid by several methods. One being to walk all the events in the
memory location to make sure they are all valid. The length of the
event is used to move to the next event. This length is determined by
the data in the buffer. If that length is corrupted, it could
possibly make the next event to check located at a bad memory
location.
Validate the length field of the event when doing the event walk.
- Fix function graph on archs that do not support use of ftrace_ops
When an architecture defines HAVE_DYNAMIC_FTRACE_WITH_ARGS, it means
that its function graph tracer uses the ftrace_ops of the function
tracer to call its callbacks. This allows a single registered
callback to be called directly instead of checking the callback's
meta data's hash entries against the function being traced.
For architectures that do not support this feature, it must always
call the loop function that tests each registered callback (even if
there's only one). The loop function tests each callback's meta data
against its hash of functions and will call its callback if the
function being traced is in its hash map.
The issue was that there was no check against this and the direct
function was being called even if the architecture didn't support it.
This meant that if function tracing was enabled at the same time as a
callback was registered with the function graph tracer, its callback
would be called for every function that the function tracer also
traced, even if the callback's meta data only wanted to be called
back for a small subset of functions.
Prevent the direct calling for those architectures that do not
support it.
- Fix references to trace_event_file for hist files
The hist files used event_file_data() to get a reference to the
associated trace_event_file the histogram was attached to. This would
return a pointer even if the trace_event_file is about to be freed
(via RCU). Instead it should use the event_file_file() helper that
returns NULL if the trace_event_file is marked to be freed so that no
new references are added to it.
- Wake up hist poll readers when an event is being freed
When polling on a hist file, the task is only awoken when a hist
trigger is triggered. This means that if an event is being freed
while there's a task waiting on its hist file, it will need to wait
until the hist trigger occurs to wake it up and allow the freeing to
happen. Note, the event will not be completely freed until all
references are removed, and a hist poller keeps a reference. But it
should still be woken when the event is being freed.
* tag 'trace-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Wake up poll waiters for hist files when removing an event
tracing: Fix checking of freed trace_event_file for hist files
fgraph: Do not call handlers direct when not using ftrace_ops
tracing: ring-buffer: Fix to check event length before using
ring-buffer: Fix possible dereference of uninitialized pointer
|
|
The event_hist_poll() function attempts to verify whether an event file is
being removed, but this check may not occur or could be unnecessarily
delayed. This happens because hist_poll_wakeup() is currently invoked only
from event_hist_trigger() when a hist command is triggered. If the event
file is being removed, no associated hist command will be triggered and a
waiter will be woken up only after an unrelated hist command is triggered.
Fix the issue by adding a call to hist_poll_wakeup() in
remove_event_file_dir() after setting the EVENT_FILE_FL_FREED flag. This
ensures that a task polling on a hist file is woken up and receives
EPOLLERR.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-3-petr.pavlu@suse.com
Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The event_hist_open() and event_hist_poll() functions currently retrieve
a trace_event_file pointer from a file struct by invoking
event_file_data(), which simply returns file->f_inode->i_private. The
functions then check if the pointer is NULL to determine whether the event
is still valid. This approach is flawed because i_private is assigned when
an eventfs inode is allocated and remains set throughout its lifetime.
Instead, the code should call event_file_file(), which checks for
EVENT_FILE_FL_FREED. Using the incorrect access function may result in the
code potentially opening a hist file for an event that is being removed or
becoming stuck while polling on this file.
Correct the access method to event_file_file() in both functions.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-2-petr.pavlu@suse.com
Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The function graph tracer was modified to us the ftrace_ops of the
function tracer. This simplified the code as well as allowed more features
of the function graph tracer.
Not all architectures were converted over as it required the
implementation of HAVE_DYNAMIC_FTRACE_WITH_ARGS to implement. For those
architectures, it still did it the old way where the function graph tracer
handle was called by the function tracer trampoline. The handler then had
to check the hash to see if the registered handlers wanted to be called by
that function or not.
In order to speed up the function graph tracer that used ftrace_ops, if
only one callback was registered with function graph, it would call its
function directly via a static call.
Now, if the architecture does not support the use of using ftrace_ops and
still has the ftrace function trampoline calling the function graph
handler, then by doing a direct call it removes the check against the
handler's hash (list of functions it wants callbacks to), and it may call
that handler for functions that the handler did not request calls for.
On 32bit x86, which does not support the ftrace_ops use with function
graph tracer, it shows the issue:
~# trace-cmd start -p function -l schedule
~# trace-cmd show
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) * 11898.94 us | schedule();
3) # 1783.041 us | schedule();
1) | schedule() {
------------------------------------------
1) bash-8369 => kworker-7669
------------------------------------------
1) | schedule() {
------------------------------------------
1) kworker-7669 => bash-8369
------------------------------------------
1) + 97.004 us | }
1) | schedule() {
[..]
Now by starting the function tracer is another instance:
~# trace-cmd start -B foo -p function
This causes the function graph tracer to trace all functions (because the
function trace calls the function graph tracer for each on, and the
function graph trace is doing a direct call):
~# trace-cmd show
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
1) 1.669 us | } /* preempt_count_sub */
1) + 10.443 us | } /* _raw_spin_unlock_irqrestore */
1) | tick_program_event() {
1) | clockevents_program_event() {
1) 1.044 us | ktime_get();
1) 6.481 us | lapic_next_event();
1) + 10.114 us | }
1) + 11.790 us | }
1) ! 181.223 us | } /* hrtimer_interrupt */
1) ! 184.624 us | } /* __sysvec_apic_timer_interrupt */
1) | irq_exit_rcu() {
1) 0.678 us | preempt_count_sub();
When it should still only be tracing the schedule() function.
To fix this, add a macro FGRAPH_NO_DIRECT to be set to 0 when the
architecture does not support function graph use of ftrace_ops, and set to
1 otherwise. Then use this macro to know to allow function graph tracer to
call the handlers directly or not.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://patch.msgid.link/20260218104244.5f14dade@gandalf.local.home
Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize entry function")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Check the event length before adding it for accessing next index in
rb_read_data_buffer(). Since this function is used for validating
possibly broken ring buffers, the length of the event could be broken.
In that case, the new event (e + len) can point a wrong address.
To avoid invalid memory access at boot, check whether the length of
each event is in the possible range before using it.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events")
Link: https://patch.msgid.link/177123421541.142205.9414352170164678966.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
There is a pointer head_page in rb_meta_validate_events() which is not
initialized at the beginning of a function. This pointer can be dereferenced
if there is a failure during reader page validation. In this case the control
is passed to "invalid" label where the pointer is dereferenced in a loop.
To fix the issue initialize orig_head and head_page before calling
rb_validate_buffer.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260213100130.2013839-1-d.dulov@aladdin.ru
Closes: https://lore.kernel.org/r/202406130130.JtTGRf7W-lkp@intel.com/
Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events")
Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull more block updates from Jens Axboe:
- Fix partial IOVA mapping cleanup in error handling
- Minor prep series ignoring discard return value, as
the inline value is always known
- Ensure BLK_FEAT_STABLE_WRITES is set for drbd
- Fix leak of folio in bio_iov_iter_bounce_read()
- Allow IOC_PR_READ_* for read-only open
- Another debugfs deadlock fix
- A few doc updates
* tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
blk-mq: use NOIO context to prevent deadlock during debugfs creation
blk-stat: convert struct blk_stat_callback to kernel-doc
block: fix enum descriptions kernel-doc
block: update docs for bio and bvec_iter
block: change return type to void
nvmet: ignore discard return value
md: ignore discard return value
block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova
block: fix folio leak in bio_iov_iter_bounce_read()
block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ
drbd: always set BLK_FEAT_STABLE_WRITES
|
|
Creating debugfs entries can trigger fs reclaim, which can enter back
into the block layer request_queue. This can cause deadlock if the
queue is frozen.
Previously, a WARN_ON_ONCE check was used in debugfs_create_files()
to detect this condition, but it was racy since the queue can be frozen
from another context at any time.
Introduce blk_debugfs_lock()/blk_debugfs_unlock() helpers that combine
the debugfs_mutex with memalloc_noio_save()/restore() to prevent fs
reclaim from triggering block I/O. Also add blk_debugfs_lock_nomemsave()
and blk_debugfs_unlock_nomemrestore() variants for callers that don't
need NOIO protection (e.g., debugfs removal or read-only operations).
Replace all raw debugfs_mutex lock/unlock pairs with these helpers,
using the _nomemsave/_nomemrestore variants where appropriate.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/all/aYWQR7CtYdk3K39g@shinmob/
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|