linux/tools/testing/selftests/bpf/benchs, branch master

selftests/bpf: Remove kmalloc tracing from local storage create bench

2026-04-11T04:22:31Z

Remove the raw_tp/kmalloc BPF program and its associated reporting from the local storage create benchmark. The kmalloc count per create is not a useful metric as different code paths use different allocators (e.g. kmalloc_nolock vs kzalloc), introducing noise that makes the number hard to interpret. Keep total_creates in the summary output as it is useful for normalizing perf statistics collected alongside the benchmark. Signed-off-by: Amery Hung Link: https://lore.kernel.org/r/20260411015419.114016-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov

selftests/bpf: Add usdt trigger bench

2026-03-03T16:39:22Z

Adding usdt trigger bench for usdt: trig-usdt-nop - usdt on top of nop1 instruction trig-usdt-nop5 - usdt on top of nop1/nop5 combo Adding it to benchs/run_bench_uprobes.sh script. Example run on x86_64 kernel with uprobe syscall: # ./benchs/run_bench_uprobes.sh usermode-count : 152.507 ± 0.098M/s syscall-count : 14.309 ± 0.093M/s uprobe-nop : 3.190 ± 0.012M/s uprobe-push : 3.057 ± 0.004M/s uprobe-ret : 1.095 ± 0.009M/s uprobe-nop5 : 7.305 ± 0.034M/s uretprobe-nop : 2.175 ± 0.005M/s uretprobe-push : 2.109 ± 0.003M/s uretprobe-ret : 0.945 ± 0.002M/s uretprobe-nop5 : 3.530 ± 0.006M/s usdt-nop : 3.235 ± 0.008M/s <-- added usdt-nop5 : 7.511 ± 0.045M/s <-- added Signed-off-by: Jiri Olsa Link: https://lore.kernel.org/r/20260224103915.1369690-6-jolsa@kernel.org Signed-off-by: Alexei Starovoitov

selftests/bpf: Refactor bpf_get_ksyms() trace helper

2026-02-24T16:19:49Z

ASAN reported a memory leak in bpf_get_ksyms(): it allocates a struct ksyms internally and never frees it. Move struct ksyms to trace_helpers.h and return it from the bpf_get_ksyms(), giving ownership to the caller. Add filtered_syms and filtered_cnt fields to the ksyms to hold the filtered array of symbols, previously returned by bpf_get_ksyms(). Fixup the call sites: kprobe_multi_test and bench_trigger. Signed-off-by: Ihor Solodrai Acked-by: Eduard Zingerman Link: https://lore.kernel.org/r/20260223190736.649171-10-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov

selftests/bpf: Allow to benchmark trigger with stacktrace

2026-01-30T21:40:09Z

Adding support to call bpf_get_stackid helper from trigger programs, so far added for kprobe multi. Adding the --stacktrace/-g option to enable it. Signed-off-by: Jiri Olsa Signed-off-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20260126211837.472802-7-jolsa@kernel.org

selftests/bpf: Add perfbuf multi-producer benchmark

2026-01-20T19:37:25Z

Add a multi-producer benchmark for perfbuf to complement the existing ringbuf multi-producer test. Unlike ringbuf which uses a shared buffer and experiences contention, perfbuf uses per-CPU buffers so the test measures scaling behavior rather than contention. This allows developers to compare perfbuf vs ringbuf performance under multi-producer workloads when choosing between the two for their systems. Signed-off-by: Gyutae Bae Signed-off-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20260120090716.82927-1-gyutae.opensource@navercorp.com

selftests/bpf: Call bpf_get_numa_node_id() in trigger_count()

2025-11-25T22:32:50Z

The bench test "trig-kernel-count" can be used as a baseline comparison for fentry and other benchmarks, and the calling to bpf_get_numa_node_id() should be considered as composition of the baseline. So, let's call it in trigger_count(). Meanwhile, rename trigger_count() to trigger_kernel_count() to make it easier understand. Signed-off-by: Menglong Dong Signed-off-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20251116014242.151110-1-dongml2@chinatelecom.cn

selftests/bpf/benchs: Add overwrite mode benchmark for BPF ring buffer

2025-10-28T02:47:32Z

Add --rb-overwrite option to benchmark BPF ring buffer in overwrite mode. Since overwrite mode is not yet supported by libbpf for consumer, also add --rb-bench-producer option to benchmark producer directly without a consumer. Benchmarks on an x86_64 and an arm64 CPU are shown below for reference. - AMD EPYC 9654 (x86_64) Ringbuf, multi-producer contention in overwrite mode, no consumer ================================================================= rb-prod nr_prod 1 32.180 ± 0.033M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 2 9.617 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 3 8.810 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 4 9.272 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 8 9.173 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 12 3.086 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 16 2.945 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 20 2.519 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 24 2.545 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 28 2.363 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 32 2.357 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 36 2.267 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 40 2.284 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 44 2.215 ± 0.025M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 48 2.193 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 52 2.208 ± 0.024M/s (drops 0.000 ± 0.000M/s) - HiSilicon Kunpeng 920 (arm64) Ringbuf, multi-producer contention in overwrite mode, no consumer ================================================================= rb-prod nr_prod 1 14.478 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 2 21.787 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 3 6.045 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 4 5.352 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 8 4.850 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 12 3.542 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 16 3.509 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 20 3.171 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 24 3.154 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 28 2.974 ± 0.015M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 32 3.167 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 36 2.903 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 40 2.866 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 44 2.914 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 48 2.806 ± 0.012M/s (drops 0.000 ± 0.000M/s) Rb-prod nr_prod 52 2.840 ± 0.012M/s (drops 0.000 ± 0.000M/s) Signed-off-by: Xu Kuohai Signed-off-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20251018035738.4039621-4-xukuohai@huaweicloud.com

selftests/bpf: Fix incorrect array size calculation

2025-09-09T16:23:47Z

The loop in bench_sockmap_prog_destroy() has two issues: 1. Using 'sizeof(ctx.fds)' as the loop bound results in the number of bytes, not the number of file descriptors, causing the loop to iterate far more times than intended. 2. The condition 'ctx.fds[0] > 0' incorrectly checks only the first fd for all iterations, potentially leaving file descriptors unclosed. Change it to 'ctx.fds[i] > 0' to check each fd properly. These fixes ensure correct cleanup of all file descriptors when the benchmark exits. Reported-by: Dan Carpenter Signed-off-by: Jiayuan Chen Signed-off-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20250909124721.191555-1-jiayuan.chen@linux.dev Closes: https://lore.kernel.org/bpf/aLqfWuRR9R_KTe5e@stanley.mountain/

selftests/bpf: add benchmark testing for kprobe-multi-all

2025-09-04T16:00:25Z

For now, the benchmark for kprobe-multi is single, which means there is only 1 function is hooked during testing. Add the testing "kprobe-multi-all", which will hook all the kernel functions during the benchmark. And the "kretprobe-multi-all" is added too. Signed-off-by: Menglong Dong Link: https://lore.kernel.org/r/20250904021011.14069-4-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov

selftests/bpf: Add LPM trie microbenchmarks

2025-08-28T00:28:14Z

Add benchmarks for the standard set of operations: LOOKUP, INSERT, UPDATE, DELETE. Also include benchmarks to measure the overhead of the bench framework itself (NOOP) as well as the overhead of generating keys (BASELINE). Lastly, this includes a benchmark for FREE (trie_free()) which is known to have terrible performance for maps with many entries. Benchmarks operate on tries without gaps in the key range, i.e. each test begins or ends with a trie with valid keys in the range [0, nr_entries). This is intended to cause maximum branching when traversing the trie. LOOKUP, UPDATE, DELETE, and FREE fill a BPF LPM trie from userspace using bpf_map_update_batch() and run the corresponding benchmark operation via bpf_loop(). INSERT starts with an empty map and fills it kernel-side from bpf_loop(). FREE records the time to free a filled LPM trie by attaching and destroying a BPF prog. NOOP measures the overhead of the test harness by running an empty function with bpf_loop(). BASELINE is similar to NOOP except that the function generates a key. Each operation runs 10,000 times using bpf_loop(). Note that this value is intentionally independent of the number of entries in the LPM trie so that the stability of the results isn't affected by the number of entries. For those benchmarks that need to reset the LPM trie once it's full (INSERT) or empty (DELETE), throughput and latency results are scaled by the fraction of a second the operation actually ran to ignore any time spent reinitialising the trie. By default, benchmarks run using sequential keys in the range [0, nr_entries). BASELINE, LOOKUP, and UPDATE can use random keys via the --random parameter but beware there is a runtime cost involved in generating random keys. Other benchmarks are prohibited from using random keys because it can skew the results, e.g. when inserting an existing key or deleting a missing one. All measurements are recorded from within the kernel to eliminate syscall overhead. Most benchmarks run an XDP program to generate stats but FREE needs to collect latencies using fentry/fexit on map_free_deferred() because it's not possible to use fentry directly on lpm_trie.c since commit c83508da5620 ("bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs") and there's no way to create/destroy a map from within an XDP program. Here is example output from an AMD EPYC 9684X 96-Core machine for each of the benchmarks using a trie with 10K entries and a 32-bit prefix length, e.g. $ ./bench lpm-trie-$op \ --prefix_len=32 \ --producers=1 \ --nr_entries=10000 noop: throughput 74.417 ± 0.032 M ops/s ( 74.417M ops/prod), latency 13.438 ns/op baseline: throughput 70.107 ± 0.171 M ops/s ( 70.107M ops/prod), latency 14.264 ns/op lookup: throughput 8.467 ± 0.047 M ops/s ( 8.467M ops/prod), latency 118.109 ns/op insert: throughput 2.440 ± 0.015 M ops/s ( 2.440M ops/prod), latency 409.290 ns/op update: throughput 2.806 ± 0.042 M ops/s ( 2.806M ops/prod), latency 356.322 ns/op delete: throughput 4.625 ± 0.011 M ops/s ( 4.625M ops/prod), latency 215.613 ns/op free: throughput 0.578 ± 0.006 K ops/s ( 0.578K ops/prod), latency 1.730 ms/op And the same benchmarks using random keys: $ ./bench lpm-trie-$op \ --prefix_len=32 \ --producers=1 \ --nr_entries=10000 \ --random noop: throughput 74.259 ± 0.335 M ops/s ( 74.259M ops/prod), latency 13.466 ns/op baseline: throughput 35.150 ± 0.144 M ops/s ( 35.150M ops/prod), latency 28.450 ns/op lookup: throughput 7.119 ± 0.048 M ops/s ( 7.119M ops/prod), latency 140.469 ns/op insert: N/A update: throughput 2.736 ± 0.012 M ops/s ( 2.736M ops/prod), latency 365.523 ns/op delete: N/A free: N/A Signed-off-by: Matt Fleming Signed-off-by: Jesper Dangaard Brouer Link: https://lore.kernel.org/r/20250827140149.1001557-1-matt@readmodwrite.com Signed-off-by: Alexei Starovoitov