summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorLines
2024-06-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski-21/+33
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/pensando/ionic/ionic_txrx.c d9c04209990b ("ionic: Mark error paths in the data path as unlikely") 491aee894a08 ("ionic: fix kernel panic in XDP_TX action") net/ipv6/ip6_fib.c b4cb4a1391dc ("net: use unrcu_pointer() helper") b01e1c030770 ("ipv6: fix possible race in __fib6_drop_pcpu_from()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-06rcu/tasks: Fix stale task snaphot for Tasks TraceFrederic Weisbecker-7/+17
When RCU-TASKS-TRACE pre-gp takes a snapshot of the current task running on all online CPUs, no explicit ordering synchronizes properly with a context switch. This lack of ordering can permit the new task to miss pre-grace-period update-side accesses. The following diagram, courtesy of Paul, shows the possible bad scenario: CPU 0 CPU 1 ----- ----- // Pre-GP update side access WRITE_ONCE(*X, 1); smp_mb(); r0 = rq->curr; RCU_INIT_POINTER(rq->curr, TASK_B) spin_unlock(rq) rcu_read_lock_trace() r1 = X; /* ignore TASK_B */ Either r0==TASK_B or r1==1 is needed but neither is guaranteed. One possible solution to solve this is to wait for an RCU grace period at the beginning of the RCU-tasks-trace grace period before taking the current tasks snaphot. However this would introduce large additional latencies to RCU-tasks-trace grace periods. Another solution is to lock the target runqueue while taking the current task snapshot. This ensures that the update side sees the latest context switch and subsequent context switches will see the pre-grace-period update side accesses. This commit therefore adds runqueue locking to cpu_curr_snapshot(). Fixes: e386b6725798 ("rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-06rcutorture: Add missing MODULE_DESCRIPTION() macrosJeff Johnson-0/+3
Fix the following 'make W=1' warnings: WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/rcutorture.o WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/rcuscale.o WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/refscale.o Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-06rcutorture: Fix rcu_torture_fwd_cb_cr() data racePaul E. McKenney-1/+1
On powerpc systems, spinlock acquisition does not order prior stores against later loads. This means that this statement: rfcp->rfc_next = NULL; Can be reordered to follow this statement: WRITE_ONCE(*rfcpp, rfcp); Which is then a data race with rcu_torture_fwd_prog_cr(), specifically, this statement: rfcpn = READ_ONCE(rfcp->rfc_next) KCSAN located this data race, which represents a real failure on powerpc. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Marco Elver <elver@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: <kasan-dev@googlegroups.com>
2024-06-06kcsan: Add missing MODULE_DESCRIPTION() macroJeff Johnson-0/+1
Fix the warning reported by 'make C=1 W=1': WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/kcsan/kcsan_test.o Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-06Merge tag 'net-6.10-rc3' of ↵Linus Torvalds-10/+10
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from BPF and big collection of fixes for WiFi core and drivers. Current release - regressions: - vxlan: fix regression when dropping packets due to invalid src addresses - bpf: fix a potential use-after-free in bpf_link_free() - xdp: revert support for redirect to any xsk socket bound to the same UMEM as it can result in a corruption - virtio_net: - add missing lock protection when reading return code from control_buf - fix false-positive lockdep splat in DIM - Revert "wifi: wilc1000: convert list management to RCU" - wifi: ath11k: fix error path in ath11k_pcic_ext_irq_config Previous releases - regressions: - rtnetlink: make the "split" NLM_DONE handling generic, restore the old behavior for two cases where we started coalescing those messages with normal messages, breaking sloppily-coded userspace - wifi: - cfg80211: validate HE operation element parsing - cfg80211: fix 6 GHz scan request building - mt76: mt7615: add missing chanctx ops - ath11k: move power type check to ASSOC stage, fix connecting to 6 GHz AP - ath11k: fix WCN6750 firmware crash caused by 17 num_vdevs - rtlwifi: ignore IEEE80211_CONF_CHANGE_RETRY_LIMITS - iwlwifi: mvm: fix a crash on 7265 Previous releases - always broken: - ncsi: prevent multi-threaded channel probing, a spec violation - vmxnet3: disable rx data ring on dma allocation failure - ethtool: init tsinfo stats if requested, prevent unintentionally reporting all-zero stats on devices which don't implement any - dst_cache: fix possible races in less common IPv6 features - tcp: auth: don't consider TCP_CLOSE to be in TCP_AO_ESTABLISHED - ax25: fix two refcounting bugs - eth: ionic: fix kernel panic in XDP_TX action Misc: - tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTAB" * tag 'net-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (107 commits) selftests: net: lib: set 'i' as local selftests: net: lib: avoid error removing empty netns name selftests: net: lib: support errexit with busywait net: ethtool: fix the error condition in ethtool_get_phy_stats_ethtool() ipv6: fix possible race in __fib6_drop_pcpu_from() af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill(). af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen(). af_unix: Use skb_queue_empty_lockless() in unix_release_sock(). af_unix: Use unix_recvq_full_lockless() in unix_stream_connect(). af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen. af_unix: Annotate data-races around sk->sk_sndbuf. af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG. af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb(). af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg(). af_unix: Annotate data-race of sk->sk_state in unix_accept(). af_unix: Annotate data-race of sk->sk_state in unix_stream_connect(). af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll(). af_unix: Annotate data-race of sk->sk_state in unix_inq_len(). af_unix: Annodate data-races around sk->sk_state for writers. af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer. ...
2024-06-06ftrace: Rename dup_hash() and comment itSteven Rostedt (Google)-2/+6
The name "dup_hash()" is a misnomer as it does not duplicate the hash that is passed in, but instead moves its entities from that hash to a newly allocated one. Rename it to "__move_hash()" (using starting underscores as it is an internal function), and add some comments about what it does. Link: https://lore.kernel.org/linux-trace-kernel/20240605180408.537723591@goodmis.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06tracing: Fix trace_pid_list_free() kernel-docJeff Johnson-0/+1
make C=1 reports: kernel/trace/pid_list.c:458: warning: Function parameter or struct member 'pid_list' not described in 'trace_pid_list_free' Add the missing parameter to the trace_pid_list_free() kernel-doc. Link: https://lore.kernel.org/linux-trace-kernel/20240506-trace_pid_list_free-kdoc-v1-1-c70f0ae29144@quicinc.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-05ftrace: Add back ftrace_update_trampoline() to ftrace_update_pid_func()Steven Rostedt (Google)-0/+1
The update to the ops trampoline done by the function ftrace_update_trampoline() was accidentally removed from ftrace_update_pid_func(). Add it back. Link: https://lore.kernel.org/linux-trace-kernel/20240605205337.6115e9a5@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: df3ec5da6a1e ("function_graph: Add pid tracing back to function graph tracer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-05irqdomain: Add missing parameter descriptions in kernel-doc commentsHerve Codina-0/+27
During compilation, several warning of the following form were raised: Function parameter or struct member 'x' not described in 'yyy' Add the missing function parameter descriptions. Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240527161450.326615-10-herve.codina@bootlin.com
2024-06-05sched/balance: Skip unnecessary updates to idle load balancer's flagsTim Chen-0/+7
We observed that the overhead on trigger_load_balance(), now renamed sched_balance_trigger(), has risen with a system's core counts. For an OLTP workload running 6.8 kernel on a 2 socket x86 systems having 96 cores/socket, we saw that 0.7% cpu cycles are spent in trigger_load_balance(). On older systems with fewer cores/socket, this function's overhead was less than 0.1%. The cause of this overhead was that there are multiple cpus calling kick_ilb(flags), updating the balancing work needed to a common idle load balancer cpu. The ilb_cpu's flags field got updated unconditionally with atomic_fetch_or(). The atomic read and writes to ilb_cpu's flags causes much cache bouncing and cpu cycles overhead. This is seen in the annotated profile below. kick_ilb(): if (ilb_cpu < 0) test %r14d,%r14d ↑ js 6c flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu)); mov $0x2d600,%rdi movslq %r14d,%r8 mov %rdi,%rdx add -0x7dd0c3e0(,%r8,8),%rdx arch_atomic_read(): 0.01 mov 0x64(%rdx),%esi 35.58 add $0x64,%rdx arch_atomic_fetch_or(): static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v) { int val = arch_atomic_read(v); do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i)); 0.03 157: mov %r12d,%ecx arch_atomic_try_cmpxchg(): return arch_try_cmpxchg(&v->counter, old, new); 0.00 mov %esi,%eax arch_atomic_fetch_or(): do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i)); or %esi,%ecx arch_atomic_try_cmpxchg(): return arch_try_cmpxchg(&v->counter, old, new); 0.01 lock cmpxchg %ecx,(%rdx) 42.96 ↓ jne 2d2 kick_ilb(): With instrumentation, we found that 81% of the updates do not result in any change in the ilb_cpu's flags. That is, multiple cpus are asking the ilb_cpu to do the same things over and over again, before the ilb_cpu has a chance to run NOHZ load balance. Skip updates to ilb_cpu's flags if no new work needs to be done. Such updates do not change ilb_cpu's NOHZ flags. This requires an extra atomic read but it is less expensive than frequent unnecessary atomic updates that generate cache bounces. We saw that on the OLTP workload, cpu cycles from trigger_load_balance() (or sched_balance_trigger()) got reduced from 0.7% to 0.2%. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240531205452.65781-1-tim.c.chen@linux.intel.com
2024-06-05idle: Remove stale RCU commentChristian Loehle-6/+0
The call of rcu_idle_enter() from within cpuidle_idle_call() was removed in commit 1098582a0f6c ("sched,idle,rcu: Push rcu_idle deeper into the idle path") which makes the comment out of place. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/5b936388-47df-4050-9229-6617a6c2bba5@arm.com
2024-06-05perf/core: Fix missing wakeup when waiting for context referenceHaifeng Xu-0/+13
In our production environment, we found many hung tasks which are blocked for more than 18 hours. Their call traces are like this: [346278.191038] __schedule+0x2d8/0x890 [346278.191046] schedule+0x4e/0xb0 [346278.191049] perf_event_free_task+0x220/0x270 [346278.191056] ? init_wait_var_entry+0x50/0x50 [346278.191060] copy_process+0x663/0x18d0 [346278.191068] kernel_clone+0x9d/0x3d0 [346278.191072] __do_sys_clone+0x5d/0x80 [346278.191076] __x64_sys_clone+0x25/0x30 [346278.191079] do_syscall_64+0x5c/0xc0 [346278.191083] ? syscall_exit_to_user_mode+0x27/0x50 [346278.191086] ? do_syscall_64+0x69/0xc0 [346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20 [346278.191092] ? irqentry_exit+0x19/0x30 [346278.191095] ? exc_page_fault+0x89/0x160 [346278.191097] ? asm_exc_page_fault+0x8/0x30 [346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae The task was waiting for the refcount become to 1, but from the vmcore, we found the refcount has already been 1. It seems that the task didn't get woken up by perf_event_release_kernel() and got stuck forever. The below scenario may cause the problem. Thread A Thread B ... ... perf_event_free_task perf_event_release_kernel ... acquire event->child_mutex ... get_ctx ... release event->child_mutex acquire ctx->mutex ... perf_free_event (acquire/release event->child_mutex) ... release ctx->mutex wait_var_event acquire ctx->mutex acquire event->child_mutex # move existing events to free_list release event->child_mutex release ctx->mutex put_ctx ... ... In this case, all events of the ctx have been freed, so we couldn't find the ctx in free_list and Thread A will miss the wakeup. It's thus necessary to add a wakeup after dropping the reference. Fixes: 1cf8dfe8a661 ("perf/core: Fix race between close() and fork()") Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20240513103948.33570-1-haifeng.xu@shopee.com
2024-06-05sched/headers: Move struct pre-declarations to the beginning of the headerIngo Molnar-10/+6
There's a random number of structure pre-declaration lines in kernel/sched/sched.h, some of which are unnecessary duplicates. Move them to the head & order them a bit for readability. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org
2024-06-05sched/core: Clean up kernel/sched/sched.h a bitIngo Molnar-132/+180
- Fix whitespace noise - Fix col80 linebreak damage where possible - Apply CodingStyle consistently - Use consistent #else and #endif comments - Use consistent vertical alignment - Use 'extern' consistently Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org
2024-06-05sched/core: Simplify prefetch_curr_exec_start()Ingo Molnar-2/+2
Remove unnecessary use of the address operator. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org
2024-06-04function_graph: Use static_call and branch to optimize return functionSteven Rostedt (Google)-5/+18
In most cases function graph is used by a single user. Instead of calling a loop to call function graph callbacks in this case, call the function return callback directly. Use the static_key that is set when the function graph tracer has less than 2 callbacks registered. It will do the direct call in that case, and will do the loop over all callers when there are 2 or more callbacks registered. Link: https://lore.kernel.org/linux-trace-kernel/20240603190824.921460797@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Use static_call and branch to optimize entry functionSteven Rostedt (Google)-11/+66
In most cases function graph is used by a single user. Instead of calling a loop to call function graph callbacks in this case, call the function entry callback directly. Add a static_key that will be used to set the function graph logic to either do the loop (when more than one callback is registered) or to call the callback directly if there is only one registered callback. Link: https://lore.kernel.org/linux-trace-kernel/20240603190824.766858241@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Use bitmask to loop on fgraph entrySteven Rostedt (Google)-1/+7
Instead of looping through all the elements of fgraph_array[] to see if there's an gops attached to one and then calling its gops->func(). Create a fgraph_array_bitmask that sets bits when an index in the array is reserved (via the simple lru algorithm). Then only the bits set in this bitmask needs to be looked at where only elements in the array that have ops registered need to be looked at. Note, we do not care about races. If a bit is set before the gops is assigned, it only wastes time looking at the element and ignoring it (as it did before this bitmask is added). Link: https://lore.kernel.org/linux-trace-kernel/20240603190824.604448781@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Use for_each_set_bit() in __ftrace_return_to_handler()Steven Rostedt (Google)-3/+2
Instead of iterating through the entire fgraph_array[] and seeing if one of the bitmap bits are set to know to call the array's retfunc() function, use for_each_set_bit() on the bitmap itself. This will only iterate for the number of set bits. Link: https://lore.kernel.org/linux-trace-kernel/20240603190824.447448026@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04ftrace: Add multiple fgraph storage selftestMasami Hiramatsu (Google)-45/+126
Add a selftest for multiple function graph tracer with storage on a same function. In this case, the shadow stack entry will be shared among those fgraph with different data storage. So this will ensure the fgraph will not mixed those storage data. Link: https://lore.kernel.org/linux-trace-kernel/171509111465.162236.3795819216426570800.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190824.284049716@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Add selftest for passing local variablesSteven Rostedt (VMware)-0/+169
Add boot up selftest that passes variables from a function entry to a function exit, and make sure that they do get passed around. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509110271.162236.11047551496319744627.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190824.122952310@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Implement fgraph_reserve_data() and fgraph_retrieve_data()Steven Rostedt (VMware)-8/+183
Added functions that can be called by a fgraph_ops entryfunc and retfunc to store state between the entry of the function being traced to the exit of the same function. The fgraph_ops entryfunc() may call fgraph_reserve_data() to store up to 32 words onto the task's shadow ret_stack and this then can be retrieved by fgraph_retrieve_data() called by the corresponding retfunc(). Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509109089.162236.11372474169781184034.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190823.959703050@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Move graph notrace bit to shadow stack global varSteven Rostedt (VMware)-4/+15
The use of the task->trace_recursion for the logic used for the function graph no-trace was a bit of an abuse of that variable. Now that there exists global vars that are per stack for registered graph traces, use that instead. Link: https://lore.kernel.org/linux-trace-kernel/171509107907.162236.6564679266777519065.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190823.796709456@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Move graph depth stored data to shadow stack global varSteven Rostedt (VMware)-2/+32
The use of the task->trace_recursion for the logic used for the function graph depth was a bit of an abuse of that variable. Now that there exists global vars that are per stack for registered graph traces, use that instead. Link: https://lore.kernel.org/linux-trace-kernel/171509106728.162236.2398372644430125344.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190823.634870264@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Move set_graph_function tests to shadow stack global varSteven Rostedt (VMware)-18/+28
The use of the task->trace_recursion for the logic used for the set_graph_function was a bit of an abuse of that variable. Now that there exists global vars that are per stack for registered graph traces, use that instead. Link: https://lore.kernel.org/linux-trace-kernel/171509105520.162236.10339831553995971290.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190823.472955399@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Add "task variables" per task for fgraph_opsSteven Rostedt (VMware)-1/+73
Add a "task variables" array on the tasks shadow ret_stack that is the size of longs for each possible registered fgraph_ops. That's a total of 16, taking up 8 * 16 = 128 bytes (out of a page size 4k). This will allow for fgraph_ops to do specific features on a per task basis having a way to maintain state for each task. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509104383.162236.12239656156685718550.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190823.308806126@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Use a simple LRU for fgraph_array index numberMasami Hiramatsu (Google)-21/+50
Since the fgraph_array index is used for the bitmap on the shadow stack, it may leave some entries after a function_graph instance is removed. Thus if another instance reuses the fgraph_array index soon after releasing it, the fgraph may confuse to call the newer callback for the entries which are pushed by the older instance. To avoid reusing the fgraph_array index soon after releasing, introduce a simple LRU table for managing the index number. This will reduce the possibility of this confusion. Link: https://lore.kernel.org/linux-trace-kernel/171509103267.162236.6885097397289135378.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190823.147421545@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Add pid tracing back to function graph tracerSteven Rostedt (Google)-2/+45
Now that the function_graph has a main callback that handles the function graph subops tracing, it no longer honors the pid filtering of ftrace. Add back this logic in the function_graph code to update the gops callback for the entry function to test if it should trace the current task or not. Link: https://lore.kernel.org/linux-trace-kernel/20240603190822.991720703@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Have the instances use their own ftrace_ops for filteringSteven Rostedt (VMware)-41/+67
Allow for instances to have their own ftrace_ops part of the fgraph_ops that makes the funtion_graph tracer filter on the set_ftrace_filter file of the instance and not the top instance. This uses the new ftrace_startup_subops(), by using graph_ops as the "manager ops" that defines the callback function and adds the functions defined by the filters of the ops for each trace instance. The callback defined by the manager ops will call the registered fgraph ops that were added to the fgraph_array. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509102088.162236.15758883237657317789.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190822.832946261@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04ftrace: Allow subops filtering to be modifiedSteven Rostedt (Google)-28/+113
The subops filters use a "manager" ops to enable and disable its filters. The manager ops can handle more than one subops, and its filter is what controls what functions get set. Add a ftrace_hash_move_and_update_subops() function that will update the manager ops when the subops filters change. Link: https://lore.kernel.org/linux-trace-kernel/20240603190822.673932251@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04ftrace: Add subops logic to allow one ops to manage manySteven Rostedt (Google)-2/+404
There are cases where a single system will use a single function callback to handle multiple users. For example, to allow function_graph tracer to have multiple users where each can trace their own set of functions, it is useful to only have one ftrace_ops registered to ftrace that will call a function by the function_graph tracer to handle the multiplexing with the different registered function_graph tracers. Add a "subop_list" to the ftrace_ops that will hold a list of other ftrace_ops that the top ftrace_ops will manage. The function ftrace_startup_subops() that takes the manager ftrace_ops and a subop ftrace_ops it will manage. If there are no subops with the ftrace_ops yet, it will copy the ftrace_ops subop filters to the manager ftrace_ops and register that with ftrace_startup(), and adds the subop to its subop_list. If the manager ops already has something registered, it will then merge the new subop filters with what it has and enable the new functions that covers all the subops it has. To remove a subop, ftrace_shutdown_subops() is called which will use the subop_list of the manager ops to rebuild all the functions it needs to trace, and update the ftrace records to only call the functions it now has registered. If there are no more functions registered, it will then call ftrace_shutdown() to disable itself completely. Note, it is up to the manager ops callback to always make sure that the subops callbacks are called if its filter matches, as there are times in the update where the callback could be calling more functions than those that are currently registered. This could be updated to handle other systems other than function_graph, for example, fprobes could use this (but will need an interface to call ftrace_startup_subops()). Link: https://lore.kernel.org/linux-trace-kernel/20240603190822.508431129@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04ftrace: Allow function_graph tracer to be enabled in instancesSteven Rostedt (VMware)-28/+63
Now that function graph tracing can handle more than one user, allow it to be enabled in the ftrace instances. Note, the filtering of the functions is still joined by the top level set_ftrace_filter and friends, as well as the graph and nograph files. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509099743.162236.1699959255446248163.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190822.190630762@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04ftrace/function_graph: Pass fgraph_ops to function graph callbacksSteven Rostedt (VMware)-21/+33
Pass the fgraph_ops structure to the function graph callbacks. This will allow callbacks to add a descriptor to a fgraph_ops private field that wil be added in the future and use it for the callbacks. This will be useful when more than one callback can be registered to the function graph tracer. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509098588.162236.4787930115997357578.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190822.035147698@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Remove logic around ftrace_graph_entry and returnSteven Rostedt (VMware)-56/+15
The function pointers ftrace_graph_entry and ftrace_graph_return are no longer called via the function_graph tracer. Instead, an array structure is now used that will allow for multiple users of the function_graph infrastructure. The variables are still used by the architecture code for non dynamic ftrace configs, where a test is made against them to see if they point to the default stub function or not. This is how the static function tracing knows to call into the function graph tracer infrastructure or not. Two new stub functions are made. entry_run() and return_run(). The ftrace_graph_entry and ftrace_graph_return are set to them respectively when the function graph tracer is enabled, and this will trigger the architecture specific function graph code to be executed. This also requires checking the global_ops hash for all calls into the function_graph tracer. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509097408.162236.17387844142114638932.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190821.872127216@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Handle tail calls for stack unwindingMasami Hiramatsu (Google)-3/+16
For the tail-call, there would be 2 or more ftrace_ret_stacks on the ret_stack, which records "return_to_handler" as the return address except for the last one. But on the real stack, there should be 1 entry because tail-call reuses the return address on the stack and jump to the next function. In ftrace_graph_ret_addr() that is used for stack unwinding, skip tail calls as a real stack unwinder would do. Link: https://lore.kernel.org/linux-trace-kernel/171509096221.162236.8806372072523195752.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190821.717065217@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Allow multiple users to attach to function graphSteven Rostedt (VMware)-77/+302
Allow for multiple users to attach to function graph tracer at the same time. Only 16 simultaneous users can attach to the tracer. This is because there's an array that stores the pointers to the attached fgraph_ops. When a function being traced is entered, each of the ftrace_ops entryfunc is called and if it returns non zero, its index into the array will be added to the shadow stack. On exit of the function being traced, the shadow stack will contain the indexes of the ftrace_ops on the array that want their retfunc to be called. Because a function may sleep for a long time (if a task sleeps itself), the return of the function may be literally days later. If the ftrace_ops is removed, its place on the array is replaced with a ftrace_ops that contains the stub functions and that will be called when the function finally returns. If another ftrace_ops is added that happens to get the same index into the array, its return function may be called. But that's actually the way things current work with the old function graph tracer. If one tracer is removed and another is added, the new one will get the return calls of the function traced by the previous one, thus this is not a regression. This can be fixed by adding a counter to each time the array item is updated and save that on the shadow stack as well, such that it won't be called if the index saved does not match the index on the array. Note, being able to filter functions when both are called is not completely handled yet, but that shouldn't be too hard to manage. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509096221.162236.8806372072523195752.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190821.555493396@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Add an array structure that will allow multiple callbacksSteven Rostedt (VMware)-33/+81
Add an array structure that will eventually allow the function graph tracer to have up to 16 simultaneous callbacks attached. It's an array of 16 fgraph_ops pointers, that is assigned when one is registered. On entry of a function the entry of the first item in the array is called, and if it returns zero, then the callback returns non zero if it wants the return callback to be called on exit of the function. The array will simplify the process of having more than one callback attached to the same function, as its index into the array can be stored on the shadow stack. We need to only save the index, because this will allow the fgraph_ops to be freed before the function returns (which may happen if the function call schedule for a long time). Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509095075.162236.8272148192748284581.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190821.392113213@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04fgraph: Use BUILD_BUG_ON() to make sure we have structures divisible by longSteven Rostedt (VMware)-1/+5
Instead of using "ALIGN()", use BUILD_BUG_ON() as the structures should always be divisible by sizeof(long). Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509093949.162236.14518699447151894536.stgit@devnote2 Link: http://lkml.kernel.org/r/20190524111144.GI2589@hirez.programming.kicks-ass.net Link: https://lore.kernel.org/linux-trace-kernel/20240603190821.232168933@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-04function_graph: Convert ret_stack to a series of longsSteven Rostedt (VMware)-54/+82
In order to make it possible to have multiple callbacks registered with the function_graph tracer, the retstack needs to be converted from an array of ftrace_ret_stack structures to an array of longs. This will allow to store the list of callbacks on the stack for the return side of the functions. Link: https://lore.kernel.org/linux-trace-kernel/171509092742.162236.4427737821399314856.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190821.073111754@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-03bpf: limit the number of levels of a nested struct type.Kui-Feng Lee-11/+19
Limit the number of levels looking into struct types to avoid running out of stack space. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240523174202.461236-7-thinker.li@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-03bpf: look into the types of the fields of a struct type recursively.Kui-Feng Lee-23/+77
The verifier has field information for specific special types, such as kptr, rbtree root, and list head. These types are handled differently. However, we did not previously examine the types of fields of a struct type variable. Field information records were not generated for the kptrs, rbtree roots, and linked_list heads that are not located at the outermost struct type of a variable. For example, struct A { struct task_struct __kptr * task; }; struct B { struct A mem_a; } struct B var_b; It did not examine "struct A" so as not to generate field information for the kptr in "struct A" for "var_b". This patch enables BPF programs to define fields of these special types in a struct type other than the direct type of a variable or in a struct type that is the type of a field in the value type of a map. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240523174202.461236-6-thinker.li@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-03bpf: create repeated fields for arrays.Kui-Feng Lee-4/+58
The verifier uses field information for certain special types, such as kptr, rbtree root, and list head. These types are treated differently. However, we did not previously support these types in arrays. This update examines arrays and duplicates field information the same number of times as the length of the array if the element type is one of the special types. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240523174202.461236-5-thinker.li@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-03bpf: refactor btf_find_struct_field() and btf_find_datasec_var().Kui-Feng Lee-101/+79
Move common code of the two functions to btf_find_field_one(). Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240523174202.461236-4-thinker.li@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-03bpf: Remove unnecessary call to btf_field_type_size().Kui-Feng Lee-2/+2
field->size has been initialized by bpf_parse_fields() with the value returned by btf_field_type_size(). Use it instead of calling btf_field_type_size() again. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240523174202.461236-3-thinker.li@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-03bpf: Remove unnecessary checks on the offset of btf_field.Kui-Feng Lee-1/+1
reg_find_field_offset() always return a btf_field with a matching offset value. Checking the offset of the returned btf_field is unnecessary. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240523174202.461236-2-thinker.li@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-03rcutorture: Make rcutorture support srcu double call testZqiang-20/+26
This commit allows rcutorture to test double-call_srcu() when the CONFIG_DEBUG_OBJECTS_RCU_HEAD Kconfig option is enabled. The non-raw sdp structure's ->spinlock will be acquired in call_srcu(), hence this commit also removes the current IRQ and preemption disabling so as to avoid lockdep complaints. Link: https://lore.kernel.org/all/20240407112714.24460-1-qiang.zhang1211@gmail.com/ Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-03Revert "rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()"Frederic Weisbecker-30/+3
This reverts commit 28319d6dc5e2ffefa452c2377dd0f71621b5bff0. The race it fixed was subject to conditions that don't exist anymore since: 1612160b9127 ("rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks") This latter commit removes the use of SRCU that used to cover the RCU-tasks blind spot on exit between the tasklist's removal and the final preemption disabling. The task is now placed instead into a temporary list inside which voluntary sleeps are accounted as RCU-tasks quiescent states. This would disarm the deadlock initially reported against PID namespace exit. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-03rcu/nocb: Remove buggy bypass lock contention mitigationFrederic Weisbecker-27/+6
The bypass lock contention mitigation assumes there can be at most 2 contenders on the bypass lock, following this scheme: 1) One kthread takes the bypass lock 2) Another one spins on it and increment the contended counter 3) A third one (a bypass enqueuer) sees the contended counter on and busy loops waiting on it to decrement. However this assumption is wrong. There can be only one CPU to find the lock contended because call_rcu() (the bypass enqueuer) is the only bypass lock acquire site that may not already hold the NOCB lock beforehand, all the other sites must first contend on the NOCB lock. Therefore step 2) is impossible. The other problem is that the mitigation assumes that contenders all belong to the same rdp CPU, which is also impossible for a raw spinlock. In theory the warning could trigger if the enqueuer holds the bypass lock and another CPU flushes the bypass queue concurrently but this is prevented from all flush users: 1) NOCB kthreads only flush if they successfully _tried_ to lock the bypass lock. So no contention management here. 2) Flush on callbacks migration happen remotely when the CPU is offline. No concurrency against bypass enqueue. 3) Flush on deoffloading happen either locally with IRQs disabled or remotely when the CPU is not yet online. No concurrency against bypass enqueue. 4) Flush on barrier entrain happen either locally with IRQs disabled or remotely when the CPU is offline. No concurrency against bypass enqueue. For those reasons, the bypass lock contention mitigation isn't needed and is even wrong. Remove it but keep the warning reporting a contended bypass lock on a remote CPU, to keep unexpected contention awareness. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-03rcu/nocb: Use kthread parking instead of ad-hoc implementationFrederic Weisbecker-83/+36
Upon NOCB deoffloading, the rcuo kthread must be forced to sleep until the corresponding rdp is ever offloaded again. The deoffloader clears the SEGCBLIST_OFFLOADED flag, wakes up the rcuo kthread which then notices that change and clears in turn its SEGCBLIST_KTHREAD_CB flag before going to sleep, until it ever sees the SEGCBLIST_OFFLOADED flag again, should a re-offloading happen. Upon NOCB offloading, the rcuo kthread must be forced to wake up and handle callbacks until the corresponding rdp is ever deoffloaded again. The offloader sets the SEGCBLIST_OFFLOADED flag, wakes up the rcuo kthread which then notices that change and sets in turn its SEGCBLIST_KTHREAD_CB flag before going to check callbacks, until it ever sees the SEGCBLIST_OFFLOADED flag cleared again, should a de-offloading happen again. This is all a crude ad-hoc and error-prone kthread (un-)parking re-implementation. Consolidate the behaviour with the appropriate API instead. [ paulmck: Apply Qiang Zhang feedback provided in Link: below. ] Link: https://lore.kernel.org/all/20240509074046.15629-1-qiang.zhang1211@gmail.com/ Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>