aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-09-25PM: hibernate: Add pm_hibernation_mode_is_suspend()Mario Limonciello (AMD)1-0/+11
Some drivers have different flows for hibernation and suspend. If the driver opportunistically will skip thaw() then it needs a hint to know what is happening after the hibernate. Introduce a new symbol pm_hibernation_mode_is_suspend() that drivers can call to determine if suspending the system for this purpose. Tested-by: Ionut Nechita <ionut_n2001@yahoo.com> Tested-by: Kenneth Crudup <kenny@panix.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-25PM: hibernate: Fix hybrid-sleepMario Limonciello (AMD)1-0/+4
Hybrid sleep will hibernate the system followed by running through the suspend routine. Since both the hibernate and the suspend routine will call pm_restrict_gfp_mask(), pm_restore_gfp_mask() must be called before starting the suspend sequence. Add an explicit call to pm_restore_gfp_mask() to power_down() before the suspend sequence starts. Add an extra call for pm_restrict_gfp_mask() when exiting suspend so that the pm_restore_gfp_mask() call in hibernate() is balanced. Reported-by: Ionut Nechita <ionut_n2001@yahoo.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4573 Tested-by: Ionut Nechita <ionut_n2001@yahoo.com> Fixes: 12ffc3b1513eb ("PM: Restrict swap use to later in the suspend sequence") Tested-by: Kenneth Crudup <kenny@panix.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Link: https://patch.msgid.link/20250925185108.2968494-2-superm1@kernel.org [ rjw: Add comment explainig the new pm_restrict_gfp_mask() call purpose ] Cc: 6.16+ <stable@vger.kernel.org> # 6.16+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski6-7/+43
Cross-merge networking fixes after downstream PR (net-6.17-rc8). Conflicts: drivers/net/can/spi/hi311x.c 6b6968084721 ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled") 27ce71e1ce81 ("net: WQ_PERCPU added to alloc_workqueue users") https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org net/smc/smc_loopback.c drivers/dibs/dibs_loopback.c a35c04de2565 ("net/smc: fix warning in smc_rx_splice() when calling get_page()") cc21191b584c ("dibs: Move data path to dibs layer") https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org Adjacent changes: drivers/net/dsa/lantiq/lantiq_gswip.c c0054b25e2f1 ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()") 7a1eaef0a791 ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-25Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-1/+2
Pull virtio fixes from Michael Tsirkin: "virtio,vhost: last minute fixes More small fixes. Most notably this fixes crashes and hangs in vhost-net" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: MAINTAINERS, mailmap: Update address for Peter Hilber virtio_config: clarify output parameters uapi: vduse: fix typo in comment vhost: Take a reference on the task in struct vhost_task. vhost-net: flush batched before enabling notifications Revert "vhost/net: Defer TX queue re-enable until after sendmsg" vhost-net: unbreak busy polling vhost-scsi: fix argument order in tport allocation error message
2025-09-25sched: Make migrate_{en,dis}able() inlineMenglong Dong3-49/+27
For now, migrate_enable and migrate_disable are global, which makes them become hotspots in some case. Take BPF for example, the function calling to migrate_enable and migrate_disable in BPF trampoline can introduce significant overhead, and following is the 'perf top' of FENTRY's benchmark (./tools/testing/selftests/bpf/bench trig-fentry): 54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k] bpf_prog_2dcccf652aac1793_bench_trigger_fentry 10.43% [kernel] [k] migrate_enable 10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037 8.06% [kernel] [k] __bpf_prog_exit_recur 4.11% libc.so.6 [.] syscall 2.15% [kernel] [k] entry_SYSCALL_64 1.48% [kernel] [k] memchr_inv 1.32% [kernel] [k] fput 1.16% [kernel] [k] _copy_to_user 0.73% [kernel] [k] bpf_prog_test_run_raw_tp So in this commit, we make migrate_enable/migrate_disable inline to obtain better performance. The struct rq is defined internally in kernel/sched/sched.h, and the field "nr_pinned" is accessed in migrate_enable/migrate_disable, which makes it hard to make them inline. Alexei Starovoitov suggests to generate the offset of "nr_pinned" in [1], so we can define the migrate_enable/migrate_disable in include/linux/sched.h and access "this_rq()->nr_pinned" with "(void *)this_rq() + RQ_nr_pinned". The offset of "nr_pinned" is generated in include/generated/rq-offsets.h by kernel/sched/rq-offsets.c. Generally speaking, we move the definition of migrate_enable and migrate_disable to include/linux/sched.h from kernel/sched/core.c. The calling to __set_cpus_allowed_ptr() is leaved in ___migrate_enable(). The "struct rq" is not available in include/linux/sched.h, so we can't access the "runqueues" with this_cpu_ptr(), as the compilation will fail in this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr(): typeof((ptr) + 0) So we introduce the this_rq_raw() and access the runqueues with arch_raw_cpu_ptr/PERCPU_PTR directly. The variable "runqueues" is not visible in the kernel modules, and export it is not a good idea. As Peter Zijlstra advised in [2], we define and export migrate_enable/migrate_disable in kernel/sched/core.c too, and use them for the modules. Before this patch, the performance of BPF FENTRY is: fentry : 113.030 ± 0.149M/s fentry : 112.501 ± 0.187M/s fentry : 112.828 ± 0.267M/s fentry : 115.287 ± 0.241M/s After this patch, the performance of BPF FENTRY increases to: fentry : 143.644 ± 0.670M/s fentry : 149.764 ± 0.362M/s fentry : 149.642 ± 0.156M/s fentry : 145.263 ± 0.221M/s Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/bpf/CAADnVQ+5sEDKHdsJY5ZsfGDO_1SEhhQWHrt2SMBG5SYyQ+jt7w@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/20250819123214.GH4067720@noisy.programming.kicks-ass.net/ [2]
2025-09-25sched/deadline: Fix dl_server behaviourPeter Zijlstra2-23/+33
John reported undesirable behaviour with the dl_server since commit: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling"). When starving fair tasks on purpose (starting spinning FIFO tasks), his fair workload, which often goes (briefly) idle, would delay fair invocations for a second, running one invocation per second was both unexpected and terribly slow. The reason this happens is that when dl_se->server_pick_task() returns NULL, indicating no runnable tasks, it would yield, pushing any later jobs out a whole period (1 second). Instead simply stop the server. This should restore behaviour in that a later wakeup (which restarts the server) will be able to continue running (subject to the CBS wakeup rules). Notably, this does not re-introduce the behaviour cccb45d7c4295 set out to solve, any start/stop cycle is naturally throttled by the timer period (no active cancel). Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: John Stultz <jstultz@google.com>
2025-09-25sched/deadline: Fix dl_server getting stuckPeter Zijlstra3-21/+2
John found it was easy to hit lockup warnings when running locktorture on a 2 CPU VM, which he bisected down to: commit cccb45d7c429 ("sched/deadline: Less agressive dl_server handling"). While debugging it seems there is a chance where we end up with the dl_server dequeued, with dl_se->dl_server_active. This causes dl_server_start() to return without enqueueing the dl_server, thus it fails to run when RT tasks starve the cpu. When this happens, dl_server_timer() catches the '!dl_se->server_has_tasks(dl_se)' case, which then calls replenish_dl_entity() and dl_server_stopped() and finally return HRTIMER_NO_RESTART. This ends in no new timer and also no enqueue, leaving the dl_server 'dead', allowing starvation. What should have happened is for the bandwidth timer to start the zero-laxity timer, which in turn would enqueue the dl_server and cause dl_se->server_pick_task() to be called -- which will stop the dl_server if no fair tasks are observed for a whole period. IOW, it is totally irrelevant if there are fair tasks at the moment of bandwidth refresh. This removes all dl_se->server_has_tasks() users, so remove the whole thing. Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: John Stultz <jstultz@google.com>
2025-09-25ns: drop assertChristian Brauner1-2/+0
Otherwise we warn when e.g., no namespaces are configured but the initial namespace for is still around. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-25ns: move ns type into struct ns_commonChristian Brauner11-15/+13
It's misplaced in struct proc_ns_operations and ns->ops might be NULL if the namespace is compiled out but we still want to know the type of the namespace for the initial namespace struct. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-25nstree: make struct ns_tree privateChristian Brauner1-0/+14
Don't expose it directly. There's no need to do that. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-24Merge tag 'probes-fixes-v6.17-rc7' of ↵Linus Torvalds2-3/+8
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - fprobe: Even if there is a memory allocation failure, try to remove the addresses recorded until then from the filter. Previously we just skipped it. - tracing: dynevent: Add a missing lockdown check on dynevent. This dynevent is the interface for all probe events. Thus if there is no check, any probe events can be added after lock down the tracefs. * tag 'probes-fixes-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: dynevent: Add a missing lockdown check on dynevent tracing: fprobe: Fix to remove recorded module addresses from filter
2025-09-24kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFIKees Cook4-5/+5
The kernel's CFI implementation uses the KCFI ABI specifically, and is not strictly tied to a particular compiler. In preparation for GCC supporting KCFI, rename CONFIG_CFI_CLANG to CONFIG_CFI (along with associated options). Use new "transitional" Kconfig option for old CONFIG_CFI_CLANG that will enable CONFIG_CFI during olddefconfig. Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20250923213422.1105654-3-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2025-09-24Merge tag 'for-netdev' of ↵Jakub Kicinski1-0/+13
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-09-23 We've added 9 non-merge commits during the last 33 day(s) which contain a total of 10 files changed, 480 insertions(+), 53 deletions(-). The main changes are: 1) A new bpf_xdp_pull_data kfunc that supports pulling data from a frag into the linear area of a xdp_buff, from Amery Hung. This includes changes in the xdp_native.bpf.c selftest, which Nimrod's future work depends on. It is a merge from a stable branch 'xdp_pull_data' which has also been merged to bpf-next. There is a conflict with recent changes in 'include/net/xdp.h' in the net-next tree that will need to be resolved. 2) A compiler warning fix when CONFIG_NET=n in the recent dynptr skb_meta support, from Jakub Sitnicki. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests: drv-net: Pull data before parsing headers selftests/bpf: Test bpf_xdp_pull_data bpf: Support specifying linear xdp packet data size for BPF_PROG_TEST_RUN bpf: Make variables in bpf_prog_test_run_xdp less confusing bpf: Clear packet pointers after changing packet data in kfuncs bpf: Support pulling non-linear xdp data bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail bpf: Clear pfmemalloc flag when freeing all fragments bpf: Return an error pointer for skb metadata when CONFIG_NET=n ==================== Link: https://patch.msgid.link/20250924050303.2466356-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-24Merge branch 'for-next/uprobes' into for-next/coreWill Deacon1-1/+1
* for-next/uprobes: arm64: probes: Fix incorrect bl/blr address and register usage uprobes: uprobe_warn should use passed task arm64: Kconfig: Remove GCS restrictions on UPROBES arm64: uprobes: Add GCS support to uretprobes arm64: probes: Add GCS support to bl/blr/ret arm64: uaccess: Add additional userspace GCS accessors arm64: uaccess: Move existing GCS accessors definitions to gcs.h arm64: probes: Break ret out from bl/blr
2025-09-25tracing: dynevent: Add a missing lockdown check on dyneventMasami Hiramatsu (Google)1-0/+4
Since dynamic_events interface on tracefs is compatible with kprobe_events and uprobe_events, it should also check the lockdown status and reject if it is set. Link: https://lore.kernel.org/all/175824455687.45175.3734166065458520748.stgit@devnote2/ Fixes: 17911ff38aa5 ("tracing: Add locked_down checks to the open calls of files created for tracefs") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org
2025-09-24tracing: fprobe: Fix to remove recorded module addresses from filterMasami Hiramatsu (Google)1-3/+4
Even if there is a memory allocation failure in fprobe_addr_list_add(), there is a partial list of module addresses. So remove the recorded addresses from filter if exists. This also removes the redundant ret local variable. Fixes: a3dc2983ca7b ("tracing: fprobe: Cleanup fprobe hash when module unloading") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org Reviewed-by: Menglong Dong <menglong8.dong@gmail.com>
2025-09-24uprobe: Do not emulate/sstep original instruction when ip is changedJiri Olsa1-0/+7
If uprobe handler changes instruction pointer we still execute single step) or emulate the original instruction and increment the (new) ip with its length. This makes the new instruction pointer bogus and application will likely crash on illegal instruction execution. If user decided to take execution elsewhere, it makes little sense to execute the original instruction, so let's skip it. Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250916215301.664963-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-24bpf: Allow uprobe program to change context registersJiri Olsa2-2/+11
Currently uprobe (BPF_PROG_TYPE_KPROBE) program can't write to the context registers data. While this makes sense for kprobe attachments, for uprobe attachment it might make sense to be able to change user space registers to alter application execution. Since uprobe and kprobe programs share the same type (BPF_PROG_TYPE_KPROBE), we can't deny write access to context during the program load. We need to check on it during program attachment to see if it's going to be kprobe or uprobe. Storing the program's write attempt to context and checking on it during the attachment. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250916215301.664963-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-24futex: Use correct exit on failure from futex_hash_allocate_default()Sebastian Andrzej Siewior1-1/+1
copy_process() uses the wrong error exit path from futex_hash_allocate_default(). After exiting from futex_hash_allocate_default(), neither tasklist_lock nor siglock has been acquired. The exit label bad_fork_core_free unlocks both of these locks which is wrong. The next exit label, bad_fork_cancel_cgroup, is the correct exit. sched_cgroup_fork() did not allocate any resources that need to freed. Use bad_fork_cancel_cgroup on error exit from futex_hash_allocate_default(). Fixes: 7c4f75a21f636 ("futex: Allow automatic allocation of process wide futex hash") Reported-by: syzbot+80cb3cc5c14fad191a10@syzkaller.appspotmail.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Closes: https://lore.kernel.org/all/68cb1cbd.050a0220.2ff435.0599.GAE@google.com
2025-09-23Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()"Tejun Heo1-1/+1
This reverts commit c8191ee8e64a8c5c021a34e32868f2380965e82b which triggers the following suspicious RCU usage warning: [ 6.647598] ============================= [ 6.647603] WARNING: suspicious RCU usage [ 6.647605] 6.17.0-rc7-virtme #1 Not tainted [ 6.647608] ----------------------------- [ 6.647608] ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage! [ 6.647610] [ 6.647610] other info that might help us debug this: [ 6.647610] [ 6.647612] [ 6.647612] rcu_scheduler_active = 2, debug_locks = 1 [ 6.647613] 1 lock held by swapper/10/0: [ 6.647614] #0: ffff8b14bbb3cc98 (&rq->__lock){-.-.}-{2:2}, at: +raw_spin_rq_lock_nested+0x20/0x90 [ 6.647630] [ 6.647630] stack backtrace: [ 6.647633] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.17.0-rc7-virtme #1 +PREEMPT(full) [ 6.647643] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 6.647646] Sched_ext: beerland_1.0.2_g27d63fc3_x86_64_unknown_linux_gnu (enabled+all) [ 6.647648] Call Trace: [ 6.647652] <IRQ> [ 6.647655] dump_stack_lvl+0x78/0xe0 [ 6.647665] lockdep_rcu_suspicious+0x14a/0x1b0 [ 6.647672] __rhashtable_lookup.constprop.0+0x1d5/0x250 [ 6.647680] find_dsq_for_dispatch+0xbc/0x190 [ 6.647684] do_enqueue_task+0x25b/0x550 [ 6.647689] enqueue_task_scx+0x21d/0x360 [ 6.647692] ? trace_lock_acquire+0x22/0xb0 [ 6.647695] enqueue_task+0x2e/0xd0 [ 6.647698] ttwu_do_activate+0xa2/0x290 [ 6.647703] sched_ttwu_pending+0xfd/0x250 [ 6.647706] __flush_smp_call_function_queue+0x1cd/0x610 [ 6.647714] __sysvec_call_function_single+0x34/0x150 [ 6.647720] sysvec_call_function_single+0x6e/0x80 [ 6.647726] </IRQ> [ 6.647726] <TASK> [ 6.647727] asm_sysvec_call_function_single+0x1a/0x20 Reported-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23Merge branch 'bpf-next/xdp_pull_data' into 'bpf-next/master'Martin KaFai Lau1-0/+13
Merge the xdp_pull_data stable branch into the master branch. No conflict. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-09-23Merge branch 'bpf-next/xdp_pull_data' into 'bpf-next/net'Martin KaFai Lau1-0/+13
Merge the xdp_pull_data stable branch into the net branch. No conflict. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-09-23kho: make sure page being restored is actually from KHOPratyush Yadav1-7/+34
When restoring a page, no sanity checks are done to make sure the page actually came from a kexec handover. The caller is trusted to pass in the right address. If the caller has a bug and passes in a wrong address, an in-use page might be "restored" and returned, causing all sorts of memory corruption. Harden the page restore logic by stashing in a magic number in page->private along with the order. If the magic number does not match, the page won't be touched. page->private is an unsigned long. The union kho_page_info splits it into two parts, with one holding the order and the other holding the magic number. Link: https://lkml.kernel.org/r/20250917125725.665-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23kho: move sanity checks to kho_restore_page()Pratyush Yadav1-14/+14
While KHO exposes folio as the primitive externally, internally its restoration machinery operates on pages. This can be seen with kho_restore_folio() for example. It performs some sanity checks and hands it over to kho_restore_page() to do the heavy lifting of page restoration. After the work done by kho_restore_page(), kho_restore_folio() only converts the head page to folio and returns it. Similarly, deserialize_bitmap() operates on the head page directly to store the order. Move the sanity checks for valid phys and order from the public-facing kho_restore_folio() to the private-facing kho_restore_page(). This makes the boundary between page and folio clearer from KHO's perspective. While at it, drop the comment above kho_restore_page(). The comment is misleading now. The function stopped looking like free_reserved_page() since 12b9a2c05d1b4 ("kho: initialize tail pages for higher order folios properly"), and now looks even more different. Link: https://lkml.kernel.org/r/20250917125725.665-1-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23bpf: Clear packet pointers after changing packet data in kfuncsAmery Hung1-0/+13
bpf_xdp_pull_data() may change packet data and therefore packet pointers need to be invalidated. Add bpf_xdp_pull_data() to the special kfunc list instead of introducing a new KF_ flag until there are more kfuncs changing packet data. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250922233356.3356453-5-ameryhung@gmail.com
2025-09-23sched_ext: Merge branch 'for-6.17-fixes' into for-6.18Tejun Heo3-7/+33
Pull sched_ext/for-6.17-fixes to receive: 55ed11b181c4 ("sched_ext: idle: Handle migration-disabled tasks in BPF code") which conflicts with the following commit in for-6.18: 2407bae23d1e ("sched_ext: Add the @sch parameter to ext_idle helpers") The conflict is a simple context conflict which can be resolved by taking the updated parts from both commits. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23bpf: Allow union argument in trampoline based programsLeon Hwang1-4/+4
Currently, functions with 'union' arguments cannot be traced with fentry/fexit: bpftrace -e 'fentry:release_pages { exit(); }' -v The function release_pages arg0 type UNION is unsupported. The type of the 'release_pages' arg0 is defined as: typedef union { struct page **pages; struct folio **folios; struct encoded_page **encoded_pages; } release_pages_arg __attribute__ ((__transparent_union__)); This patch relaxes the restriction by allowing function arguments of type 'union' to be traced in verifier. Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20250919044110.23729-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23sched_ext: Misc updates around scx_sched instance pointerTejun Heo1-22/+40
In preparation for multiple scheduler support: - Add the @sch parameter to find_global_dsq() and refill_task_slice_dfl(). - Restructure scx_allow_ttwu_queue() and make it read scx_root into $sch. - Make RCU protection in scx_dsq_move() and scx_bpf_dsq_move_to_local() explicit. v2: Add scx_root -> sch conversion in scx_allow_ttwu_queue(). Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Drop scx_kf_exit() and scx_kf_error()Tejun Heo2-63/+88
The intention behind scx_kf_exit/error() was that when called from kfuncs, scx_kf_exit/error() would be able to implicitly determine the scx_sched instance being operated on and thus wouldn't need the @sch parameter passed in explicitly. This turned out to be unnecessarily complicated to implement and not have enough practical benefits. Replace scx_kf_exit/error() usages with scx_exit/error() which take an explicit @sch parameter. - Add the @sch parameter to scx_kf_allowed(), scx_kf_allowed_on_arg_tasks, mark_direct_dispatch() and other intermediate functions transitively. - In callers that don't already have @sch available, grab RCU, read $scx_root, verify it's not NULL and use it. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit()Tejun Heo1-7/+22
In preparation for multiple scheduler support, add the @sch parameter to scx_dsq_insert_preamble/commit() and update the callers to read $scx_root and pass it in. The passed in @sch parameter is not used yet. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Drop kf_cpu_valid()Tejun Heo2-31/+48
The intention behind kf_cpu_valid() was that when called from kfuncs, kf_cpu_valid() would be able to implicitly determine the scx_sched instance being operated on and thus wouldn't need @sch passed in explicitly. This turned out to be unnecessarily complicated to implement and not have justifiable practical benefits. Replace kf_cpu_valid() usages with ops_cpu_valid() which takes explicit @sch. Callers which don't have $sch available in the context are updated to read $scx_root under RCU read lock, verify that it's not NULL and pass it in. scx_bpf_cpu_rq() is restructured to use guard(rcu)() instead of explicit rcu_read_[un]lock(). Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Add the @sch parameter to ext_idle helpersTejun Heo1-15/+94
In preparation for multiple scheduler support, add the @sch parameter to validate_node(), check_builtin_idle_enabled() and select_cpu_from_kfunc(), and update their callers to read $scx_root, verify that it's not NULL and pass it in. The passed in @sch parameter is not used yet. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Add the @sch parameter to __bstr_format()Tejun Heo1-7/+21
In preparation for multiple scheduler support, add the @sch parameter to __bstr_format() and update the callers to read $scx_root, verify that it's not NULL and pass it in. The passed in @sch parameter is not used yet. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Separate out scx_kick_cpu() and add @sch to itTejun Heo1-16/+27
In preparation for multiple scheduler support, separate out scx_kick_cpu() from scx_bpf_kick_cpu() and add the @sch parameter to it. scx_bpf_kick_cpu() now acquires an RCU read lock, reads $scx_root, and calls scx_kick_cpu() with it if non-NULL. The passed in @sch parameter is not used yet. Internal uses of scx_bpf_kick_cpu() are converted to scx_kick_cpu(). Where $sch is available, it's used. In the pick_task_scx() path where no associated scheduler can be identified, $scx_root is used directly. Note that $scx_root cannot be NULL in this case. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()Tejun Heo2-0/+14
ops.exit() may be called even if the loading failed before ops.init() finishes successfully. This is because ops.exit() allows rich exit info communication. Add SCX_EFLAG_INITIALIZED flag to scx_exit_info.flags to indicate whether ops.init() finished successfully. This enables BPF schedulers to distinguish between exit scenarios and handle cleanup appropriately based on initialization state. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Use bitfields for boolean warning flagsTejun Heo1-2/+2
Convert warned_zero_slice and warned_deprecated_rq in scx_sched struct to single-bit bitfields. While this doesn't reduce struct size immediately, it prepares for future bitfield additions. v2: Update patch description. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq()Tejun Heo1-2/+1
task_can_run_on_remote_rq() takes @sch but it is using scx_root when incrementing SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which is inconsistent and gets in the way of implementing multiple scheduler support. Use @sch instead. As currently scx_root is the only possible scheduler instance, this doesn't cause any behavior changes. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()Tejun Heo1-1/+1
The find_user_dsq() function is called from contexts that are already under RCU read lock protection. Switch from rhashtable_lookup_fast() to rhashtable_lookup() to avoid redundant RCU locking. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23bpf, x86: Add support for signed arena loadsKumar Kartikeya Dwivedi1-3/+8
Currently, signed load instructions into arena memory are unsupported. The compiler is free to generate these, and on GCC-14 we see a corresponding error when it happens. The hurdle in supporting them is deciding which unused opcode to use to mark them for the JIT's own consumption. After much thinking, it appears 0xc0 / BPF_NOSPEC can be combined with load instructions to identify signed arena loads. Use this to recognize and JIT them appropriately, and remove the verifier side limitation on the program if the JIT supports them. Co-developed-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20250923110157.18326-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23sched_ext: Verify RCU protection in scx_bpf_cpu_curr()Andrea Righi1-1/+1
scx_bpf_cpu_curr() has been introduced to retrieve the current task of a given runqueue, allowing schedulers to interact with that task. The kfunc assumes that it is always called in an RCU context, but this is not always guaranteed and some BPF schedulers can trigger the following warning: WARNING: suspicious RCU usage sched_ext: BPF scheduler "cosmos_1.0.2_gd0e71ca_x86_64_unknown_linux_gnu_debug" enabled 6.17.0-rc1 #1-NixOS Not tainted ----------------------------- kernel/sched/ext.c:6415 suspicious rcu_dereference_check() usage! ... Call Trace: <IRQ> dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x4e/0x96 scx_bpf_cpu_curr+0x7e/0x80 bpf_prog_c68b2b6b6b1b0ff8_sched_timerfn+0xce/0x1dc bpf_timer_cb+0x7b/0x130 __hrtimer_run_queues+0x1ea/0x380 hrtimer_run_softirq+0x8c/0xd0 handle_softirqs+0xc9/0x3b0 __irq_exit_rcu+0x96/0xc0 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0x73/0x80 </IRQ> <TASK> To address this, mark the kfunc with KF_RCU_PROTECTED, so the verifier can enforce its usage only inside RCU-protected sections. Note: this also requires commit 1512231b6cc86 ("bpf: Enforce RCU protection for KF_RCU_PROTECTED"), currently in bpf-next, to enforce the proper KF_RCU_PROTECTED. Fixes: 20b158094a1ad ("sched_ext: Introduce scx_bpf_cpu_curr()") Cc: Christian Loehle <christian.loehle@arm.com> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23tracing: dynevent: Add a missing lockdown check on dyneventMasami Hiramatsu (Google)1-0/+4
Since dynamic_events interface on tracefs is compatible with kprobe_events and uprobe_events, it should also check the lockdown status and reject if it is set. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/175824455687.45175.3734166065458520748.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23tracing/osnoise: Fix slab-out-of-bounds in _parse_integer_limit()Wang Liang1-1/+2
When config osnoise cpus by write() syscall, the following KASAN splat may be observed: BUG: KASAN: slab-out-of-bounds in _parse_integer_limit+0x103/0x130 Read of size 1 at addr ffff88810121e3a1 by task test/447 CPU: 1 UID: 0 PID: 447 Comm: test Not tainted 6.17.0-rc6-dirty #288 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x55/0x70 print_report+0xcb/0x610 kasan_report+0xb8/0xf0 _parse_integer_limit+0x103/0x130 bitmap_parselist+0x16d/0x6f0 osnoise_cpus_write+0x116/0x2d0 vfs_write+0x21e/0xcc0 ksys_write+0xee/0x1c0 do_syscall_64+0xa8/0x2a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> This issue can be reproduced by below code: const char *cpulist = "1"; int fd=open("/sys/kernel/debug/tracing/osnoise/cpus", O_WRONLY); write(fd, cpulist, strlen(cpulist)); Function bitmap_parselist() was called to parse cpulist, it require that the parameter 'buf' must be terminated with a '\0' or '\n'. Fix this issue by adding a '\0' to 'buf' in osnoise_cpus_write(). Cc: <mhiramat@kernel.org> Cc: <mathieu.desnoyers@efficios.com> Cc: <tglozar@redhat.com> Link: https://lore.kernel.org/20250916063948.3154627-1-wangliang74@huawei.com Fixes: 17f89102fe23 ("tracing/osnoise: Allow arbitrarily long CPU string") Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23tracing: replace use of system_wq with system_percpu_wqMarco Crivellari1-1/+1
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. queue_work() / queue_delayed_work() mod_delayed_work() will now use the new per-cpu wq: whether the user still stick on the old name a warn will be printed along a wq redirect to the new one. This patch add the new system_percpu_wq except for mm, fs and net subsystem, whom are handled in separated patches. The old wq will be kept for a few release cylces. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/20250905091040.109772-2-marco.crivellari@suse.com Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-23bpf: task work scheduling kfuncsMykyta Yatsenko1-2/+290
Implementation of the new bpf_task_work_schedule kfuncs, that let a BPF program schedule task_work callbacks for a target task: * bpf_task_work_schedule_signal() - schedules with TWA_SIGNAL * bpf_task_work_schedule_resume() - schedules with TWA_RESUME Each map value should embed a struct bpf_task_work, which the kernel side pairs with struct bpf_task_work_kern, containing a pointer to struct bpf_task_work_ctx, that maintains metadata relevant for the concrete callback scheduling. A small state machine and refcounting scheme ensures safe reuse and teardown. State transitions: _______________________________ | | v | [standby] ---> [pending] --> [scheduling] --> [scheduled] ^ |________________|_________ | | | v | [running] |_______________________________________________________| All states may transition into FREED state: [pending] [scheduling] [scheduled] [running] [standby] -> [freed] A FREED terminal state coordinates with map-value deletion (bpf_task_work_cancel_and_free()). Scheduling itself is deferred via irq_work to keep the kfunc callable from NMI context. Lifetime is guarded with refcount_t + RCU Tasks Trace. Main components: * struct bpf_task_work_context – Metadata and state management per task work. * enum bpf_task_work_state – A state machine to serialize work scheduling and execution. * bpf_task_work_schedule() – The central helper that initiates scheduling. * bpf_task_work_acquire_ctx() - Attempts to take ownership of the context, pointed by passed struct bpf_task_work, allocates new context if none exists yet. * bpf_task_work_callback() – Invoked when the actual task_work runs. * bpf_task_work_irq() – An intermediate step (runs in softirq context) to enqueue task work. * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. Flow of successful task work scheduling 1) bpf_task_work_schedule_* is called from BPF code. 2) Transition state from STANDBY to PENDING, mark context as owned by this task work scheduler 3) irq_work_queue() schedules bpf_task_work_irq(). 4) Transition state from PENDING to SCHEDULING (noop if transition successful) 5) bpf_task_work_irq() attempts task_work_add(). If successful, state transitions to SCHEDULED. 6) Task work calls bpf_task_work_callback(), which transition state to RUNNING. 7) BPF callback is executed 8) Context is cleaned up, refcounts released, context state set back to STANDBY. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Reviewed-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250923112404.668720-8-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23bpf: extract map key pointer calculationMykyta Yatsenko1-17/+13
Calculation of the BPF map key, given the pointer to a value is duplicated in a couple of places in helpers already, in the next patch another use case is introduced as well. This patch extracts that functionality into a separate function. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250923112404.668720-7-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23bpf: bpf task work plumbingMykyta Yatsenko6-18/+189
This patch adds necessary plumbing in verifier, syscall and maps to support handling new kfunc bpf_task_work_schedule and kernel structure bpf_task_work. The idea is similar to how we already handle bpf_wq and bpf_timer. verifier changes validate calls to bpf_task_work_schedule to make sure it is safe and expected invariants hold. btf part is required to detect bpf_task_work structure inside map value and store its offset, which will be used in the next patch to calculate key and value addresses. arraymap and hashtab changes are needed to handle freeing of the bpf_task_work: run code needed to deinitialize it, for example cancel task_work callback if possible. The use of bpf_task_work and proper implementation for kfuncs are introduced in the next patch. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250923112404.668720-6-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23bpf: verifier: permit non-zero returns from async callbacksMykyta Yatsenko1-3/+2
The verifier currently enforces a zero return value for all async callbacks—a constraint originally introduced for bpf_timer. That restriction is too narrow for other async use cases. Relax the rule by allowing non-zero return codes from async callbacks in general, while preserving the zero-return requirement for bpf_timer to maintain its existing semantics. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250923112404.668720-5-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23bpf: htab: extract helper for freeing special structsMykyta Yatsenko1-12/+12
Extract the cleanup of known embedded structs into the dedicated helper. Remove duplication and introduce a single source of truth for freeing special embedded structs in hashtab. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250923112404.668720-4-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23bpf: extract generic helper from process_timer_func()Mykyta Yatsenko1-11/+36
Refactor the verifier by pulling the common logic from process_timer_func() into a dedicated helper. This allows reusing process_async_func() helper for verifying bpf_task_work struct in the next patch. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: syzbot@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20250923112404.668720-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23bpf: refactor special field-type detectionMykyta Yatsenko1-51/+33
Reduce code duplication in detection of the known special field types in map values. This refactoring helps to avoid copying a chunk of code in the next patch of the series. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250923112404.668720-2-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>