summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorLines
2025-12-30bpf: bpf_scc_visit instance and backedges accumulation for bpf_loop()Eduard Zingerman-4/+9
Calls like bpf_loop() or bpf_for_each_map_elem() introduce loops that are not explicitly present in the control-flow graph. The verifier processes such calls by repeatedly interpreting the callback function body within the same verification path (until the current state converges with a previous state). Such loops require a bpf_scc_visit instance in order to allow the accumulation of the state graph backedges. Otherwise, certain checkpoint states created within the bodies of such loops will have incomplete precision marks. See the next patch for an example of a program that leads to the verifier accepting an unsafe program. Fixes: 96c6aa4c63af ("bpf: compute SCCs in program control flow graph") Fixes: c9e31900b54c ("bpf: propagate read/precision marks over state graph backedges") Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20251229-scc-for-callbacks-v1-1-ceadfe679900@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-29Merge tag 'mm-hotfixes-stable-2025-12-28-21-50' of ↵Linus Torvalds-4/+12
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "27 hotfixes. 12 are cc:stable, 18 are MM. There's a patch series from Jiayuan Chen which fixes some issues with KASAN and vmalloc. Apart from that it's the usual shower of singletons - please see the respective changelogs for details" * tag 'mm-hotfixes-stable-2025-12-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (27 commits) mm/ksm: fix pte_unmap_unlock of wrong address in break_ksm_pmd_entry mm/page_owner: fix memory leak in page_owner_stack_fops->release() mm/memremap: fix spurious large folio warning for FS-DAX MAINTAINERS: notify the "Device Memory" community of memory hotplug changes sparse: update MAINTAINERS info mm/page_alloc: report 1 as zone_batchsize for !CONFIG_MMU mm: consider non-anon swap cache folios in folio_expected_ref_count() rust: maple_tree: rcu_read_lock() in destructor to silence lockdep mm: memcg: fix unit conversion for K() macro in OOM log mm: fixup pfnmap memory failure handling to use pgoff tools/mm/page_owner_sort: fix timestamp comparison for stable sorting selftests/mm: fix thread state check in uffd-unit-tests kernel/kexec: fix IMA when allocation happens in CMA area kernel/kexec: change the prototype of kimage_map_segment() MAINTAINERS: add ABI headers to KHO and LIVE UPDATE .mailmap: remove one of the entries for WangYuli mm/damon/vaddr: fix missing pte_unmap_unlock in damos_va_migrate_pmd_entry() MAINTAINERS: update one straggling entry for Bartosz Golaszewski mm/page_alloc: change all pageblocks migrate type on coalescing mm: leafops.h: correct kernel-doc function param. names ...
2025-12-28Merge tag 'sched_ext-for-6.19-rc3-fixes' of ↵Linus Torvalds-13/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix uninitialized @ret on alloc_percpu() failure leading to ERR_PTR(0) - Fix PREEMPT_RT warning when bypass load balancer sends IPI to offline CPU by using resched_cpu() instead of resched_curr() - Fix comment referring to renamed function - Update scx_show_state.py for scx_root and scx_aborting changes * tag 'sched_ext-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: tools/sched_ext: update scx_show_state.py for scx_aborting change tools/sched_ext: fix scx_show_state.py for scx_root change sched_ext: Use the resched_cpu() to replace resched_curr() in the bypass_lb_node() sched_ext: Fix some comments in ext.c sched_ext: fix uninitialized ret on alloc_percpu() failure
2025-12-28Merge tag 'cgroup-for-6.19-rc3-fixes' of ↵Linus Torvalds-5/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: - Fix a spurious cpuset warning when disabling remote partition after CPU hotplug leaves subpartitions_cpus empty. Guard the warning and invalidate affected partitions. * tag 'cgroup-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: fix warning when disabling remote partition
2025-12-28PM: sleep: Fix suspend_test() at the TEST_CORE levelRafael J. Wysocki-3/+6
Commit a10ad1b10402 ("PM: suspend: Make pm_test delay interruptible by wakeup events") replaced mdelay() in suspend_test() with msleep() which does not work at the TEST_CORE test level that calls suspend_test() while running on one CPU with interrupts off. Address this by making suspend_test() check if the test level is suitable for using msleep() and use mdelay() otherwise. Fixes: a10ad1b10402 ("PM: suspend: Make pm_test delay interruptible by wakeup events") Reported-by: Sebastian Reichel <sebastian.reichel@collabora.com> Closes: https://lore.kernel.org/linux-pm/aUsAk0k1N9hw8IkY@venus/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Sebastian Reichel <sebastian.reichel@collabora.com> Link: https://patch.msgid.link/6251576.lOV4Wx5bFT@rafael.j.wysocki
2025-12-26Merge tag 'efi-fixes-for-v6.19-1' of ↵Linus Torvalds-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "A couple of fixes for EFI regressions introduced this cycle: - Make EDID handling in the EFI stub mixed mode safe - Ensure that efi_mm.user_ns has a sane value - this is needed now that EFI runtime calls are preemptible on arm64" * tag 'efi-fixes-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: kthread: Warn if mm_struct lacks user_ns in kthread_use_mm() arm64: efi: Fix NULL pointer dereference by initializing user_ns efi/libstub: gop: Fix EDID support in mixed-mode
2025-12-26irqdomain: Export irq_domain_free_irqs()Aaron Kling-0/+1
Export irq_domain_free_irqs() to allow PCI/MSI drivers like pci-tegra to be built as a module. Signed-off-by: Aaron Kling <webgeek1234@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20250731-pci-tegra-module-v7-1-cad4b088b8fb@gmail.com
2025-12-24kthread: Warn if mm_struct lacks user_ns in kthread_use_mm()Breno Leitao-0/+1
Add a WARN_ON_ONCE() check to detect mm_struct instances that are missing user_ns initialization when passed to kthread_use_mm(). When a kthread adopts an mm via kthread_use_mm(), LSM hooks and capability checks may access current->mm->user_ns for credential validation. If user_ns is NULL, this leads to a NULL pointer dereference crash. This was observed with efi_mm on arm64, where commit a5baf582f4c0 ("arm64/efi: Call EFI runtime services without disabling preemption") introduced kthread_use_mm(&efi_mm), but efi_mm lacked user_ns initialization, causing crashes during /proc access. Adding this warning helps catch similar bugs early during development rather than waiting for hard-to-debug NULL pointer crashes in production. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Rik van Riel <riel@surriel.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-12-23bpf: arena: make arena kfuncs any context safePuranjay Mohan-41/+217
Make arena related kfuncs any context safe by the following changes: bpf_arena_alloc_pages() and bpf_arena_reserve_pages(): Replace the usage of the mutex with a rqspinlock for range tree and use kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages from any context. apply_range_set/clear_cb() with apply_to_page_range() has already made populating the vm_area in bpf_arena_alloc_pages() any context safe. bpf_arena_free_pages(): defer the main logic to a workqueue if it is called from a non-sleepable context. specialize_kfunc() is used to replace the sleepable arena_free_pages() with bpf_arena_free_pages_non_sleepable() when the verifier detects the call is from a non-sleepable context. In the non-sleepable case, arena_free_pages() queues the address and the page count to be freed to a lock-less list of struct arena_free_spans and raises an irq_work. The irq_work handler calls schedules_work() as it is safe to be called from irq context. arena_free_worker() (the work queue handler) iterates these spans and clears ptes, flushes tlb, zaps pages, and calls __free_page(). Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23bpf: arena: use kmalloc_nolock() in place of kvcalloc()Puranjay Mohan-29/+55
To make arena_alloc_pages() safe to be called from any context, replace kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any locks. kmalloc_nolock() returns NULL for allocations larger than KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with 4KB pages. So, round down the allocation done by kmalloc_nolock to 1024 * 8 and reuse the array in a loop. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23bpf: arena: populate vm_area without allocating memoryPuranjay Mohan-10/+90
vm_area_map_pages() may allocate memory while inserting pages into bpf arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc non-sleepable change bpf arena to populate pages without allocating memory: - at arena creation time populate all page table levels except the last level - when new pages need to be inserted call apply_to_page_range() again with apply_range_set_cb() which will only set_pte_at() those pages and will not allocate memory. - when freeing pages call apply_to_existing_page_range with apply_range_clear_cb() to clear the pte for the page to be removed. This doesn't free intermediate page table levels. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23kernel/kexec: fix IMA when allocation happens in CMA areaPingfan Liu-2/+7
*** Bug description *** When I tested kexec with the latest kernel, I ran into the following warning: [ 40.712410] ------------[ cut here ]------------ [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 [...] [ 40.816047] Call trace: [ 40.818498] kimage_map_segment+0x144/0x198 (P) [ 40.823221] ima_kexec_post_load+0x58/0xc0 [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 [...] [ 40.855423] ---[ end trace 0000000000000000 ]--- *** How to reproduce *** This bug is only triggered when the kexec target address is allocated in the CMA area. If no CMA area is reserved in the kernel, use the "cma=" option in the kernel command line to reserve one. *** Root cause *** The commit 07d24902977e ("kexec: enable CMA based contiguous allocation") allocates the kexec target address directly on the CMA area to avoid copying during the jump. In this case, there is no IND_SOURCE for the kexec segment. But the current implementation of kimage_map_segment() assumes that IND_SOURCE pages exist and map them into a contiguous virtual address by vmap(). *** Solution *** If IMA segment is allocated in the CMA area, use its page_address() directly. Link: https://lkml.kernel.org/r/20251216014852.8737-2-piliu@redhat.com Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Graf <graf@amazon.com> Cc: Steven Chen <chenste@linux.microsoft.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Roberto Sassu <roberto.sassu@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23kernel/kexec: change the prototype of kimage_map_segment()Pingfan Liu-3/+6
The kexec segment index will be required to extract the corresponding information for that segment in kimage_map_segment(). Additionally, kexec_segment already holds the kexec relocation destination address and size. Therefore, the prototype of kimage_map_segment() can be changed. Link: https://lkml.kernel.org/r/20251216014852.8737-1-piliu@redhat.com Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Roberto Sassu <roberto.sassu@huawei.com> Cc: Alexander Graf <graf@amazon.com> Cc: Steven Chen <chenste@linux.microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-22bpf: crypto: replace -EEXIST with -EBUSYDaniel Gomez-1/+1
The -EEXIST error code is reserved by the module loading infrastructure to indicate that a module is already loaded. When a module's init function returns -EEXIST, userspace tools like kmod interpret this as "module already loaded" and treat the operation as successful, returning 0 to the user even though the module initialization actually failed. This follows the precedent set by commit 54416fd76770 ("netfilter: conntrack: helper: Replace -EEXIST by -EBUSY") which fixed the same issue in nf_conntrack_helper_register(). This affects bpf_crypto_skcipher module. While the configuration required to build it as a module is unlikely in practice, it is technically possible, so fix it for correctness. Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/r/20251220-dev-module-init-eexists-bpf-v1-1-7f186663dbe7@samsung.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-22bpf: allow calling kfuncs from raw_tp programsPuranjay Mohan-0/+1
Associate raw tracepoint program type with the kfunc tracing hook. This allows calling kfuncs from raw_tp programs. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251222133250.1890587-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-22sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()Zqiang-2/+2
llist_add() returns true only when adding to an empty list, which indicates that no IRQ work is currently queued or running. Therefore, we only need to call irq_work_queue() when llist_add() returns true, to avoid unnecessarily re-queueing IRQ work that is already pending or executing. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-22sched_ext: Use the resched_cpu() to replace resched_curr() in the ↵Zqiang-7/+2
bypass_lb_node() For the PREEMPT_RT kernels, the scx_bypass_lb_timerfn() running in the preemptible per-CPU ktimer kthread context, this means that the following scenarios will occur(for x86 platform): cpu1 cpu2 ktimer kthread: ->scx_bypass_lb_timerfn ->bypass_lb_node ->for_each_cpu(cpu, resched_mask) migration/1: by preempt by migration/2: multi_cpu_stop() multi_cpu_stop() ->take_cpu_down() ->__cpu_disable() ->set cpu1 offline ->rq1 = cpu_rq(cpu1) ->resched_curr(rq1) ->smp_send_reschedule(cpu1) ->native_smp_send_reschedule(cpu1) ->if(unlikely(cpu_is_offline(cpu))) { WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu); return; } This commit therefore use the resched_cpu() to replace resched_curr() in the bypass_lb_node() to avoid send-ipi to offline CPUs. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-22cpuset: remove dead code in cpuset-v1.cChen Ridong-4/+1
The commit 6e1d31ce495c ("cpuset: separate generate_sched_domains for v1 and v2") introduced dead code that was originally added for cpuset-v2 partition domain generation. Remove the redundant root_load_balance check. Fixes: 6e1d31ce495c ("cpuset: separate generate_sched_domains for v1 and v2") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/cgroups/9a442808-ed53-4657-988b-882cc0014c0d@huaweicloud.com/T/ Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-22module/decompress: Avoid open-coded kvrealloc()Kees Cook-5/+5
Replace open-coded allocate/copy with kvrealloc(). Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-12-22module: Remove SHA-1 support for module signingPetr Pavlu-5/+0
SHA-1 is considered deprecated and insecure due to vulnerabilities that can lead to hash collisions. Most distributions have already been using SHA-2 for module signing because of this. The default was also changed last year from SHA-1 to SHA-512 in commit f3b93547b91a ("module: sign with sha512 instead of sha1 by default"). This was not reported to cause any issues. Therefore, it now seems to be a good time to remove SHA-1 support for module signing. Commit 16ab7cb5825f ("crypto: pkcs7 - remove sha1 support") previously removed support for reading PKCS#7/CMS signed with SHA-1, along with the ability to use SHA-1 for module signing. This change broke iwd and was subsequently completely reverted in commit 203a6763ab69 ("Revert "crypto: pkcs7 - remove sha1 support""). However, dropping only the support for using SHA-1 for module signing is unrelated and can still be done separately. Note that this change only removes support for new modules to be SHA-1 signed, but already signed modules can still be loaded. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-12-22module: replace use of system_wq with system_dfl_wqMarco Crivellari-2/+2
Currently if a user enqueues a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") Switch to using system_dfl_wq, the new unbound workqueue, because the users do not benefit from a per-cpu workqueue. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-12-22params: Replace __modinit with __init_or_modulePetr Pavlu-10/+5
Remove the custom __modinit macro from kernel/params.c and instead use the common __init_or_module macro from include/linux/module.h. Both provide the same functionality. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-12-21Merge tag 'irq-urgent-2025-12-21' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Ingo Molnar: "Fix IRQ thread affinity flags setup regression" * tag 'irq-urgent-2025-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Don't overwrite interrupt thread flags on setup
2025-12-21bpf: annotate file argument as __nullable in bpf_lsm_mmap_fileMatt Bobrowski-3/+33
As reported in [0], anonymous memory mappings are not backed by a struct file instance. Consequently, the struct file pointer passed to the security_mmap_file() LSM hook is NULL in such cases. The BPF verifier is currently unaware of this, allowing BPF LSM programs to dereference this struct file pointer without needing to perform an explicit NULL check. This leads to potential NULL pointer dereference and a kernel crash. Add a strong override for bpf_lsm_mmap_file() which annotates the struct file pointer parameter with the __nullable suffix. This explicitly informs the BPF verifier that this pointer (PTR_MAYBE_NULL) can be NULL, forcing BPF LSM programs to perform a check on it before dereferencing it. [0] https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/ Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/ Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20251216133000.3690723-1-mattbobrowski@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-21bpf: arm64: Optimize recursion detection by not using atomicsPuranjay Mohan-1/+2
BPF programs detect recursion using a per-CPU 'active' flag in struct bpf_prog. The trampoline currently sets/clears this flag with atomic operations. On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic operations are relatively slow. Unlike x86_64 - where per-CPU updates can avoid cross-core atomicity, arm64 LSE atomics are always atomic across all cores, which is unnecessary overhead for strictly per-CPU state. This patch removes atomics from the recursion detection path on arm64 by changing 'active' to a per-CPU array of four u8 counters, one per context: {NMI, hard-irq, soft-irq, normal}. The running context uses a non-atomic increment/decrement on its element. After increment, recursion is detected by reading the array as a u32 and verifying that only the expected element changed; any change in another element indicates inter-context recursion, and a value > 1 in the same element indicates same-context recursion. For example, starting from {0,0,0,0}, a normal-context trigger changes the array to {0,0,0,1}. If an NMI arrives on the same CPU and triggers the program, the array becomes {1,0,0,1}. When the NMI context checks the u32 against the expected mask for normal (0x00000001), it observes 0x01000001 and correctly reports recursion. Same-context recursion is detected analogously. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251219184422.2899902-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-21bpf: move recursion detection logic to helpersPuranjay Mohan-6/+6
BPF programs detect recursion by doing atomic inc/dec on a per-cpu active counter from the trampoline. Create two helpers for operations on this active counter, this makes it easy to changes the recursion detection logic in future. This commit makes no functional changes. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251219184422.2899902-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-19sched_ext: Fix some comments in ext.cZqiang-5/+5
This commit update balance_scx() in the comments to balance_one(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-19sched/fair: Fix sched_avg foldPeter Zijlstra-2/+4
After the robot reported a regression wrt commit: 089d84203ad4 ("sched/fair: Fold the sched_avg update"), Shrikanth noted that two spots missed a factor se_weight(). Fixes: 089d84203ad4 ("sched/fair: Fold the sched_avg update") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202512181208.753b9f6e-lkp@intel.com Debugged-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251218102020.GO3707891@noisy.programming.kicks-ass.net
2025-12-19perf: Use EXPORT_SYMBOL_FOR_KVM() for the mediated APIsPeter Zijlstra-2/+3
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251208115156.GE3707891@noisy.programming.kicks-ass.net
2025-12-19irqdomain: Fix up const problem in irq_domain_set_name()Greg Kroah-Hartman-1/+1
In irq_domain_set_name() a const pointer is passed in, and then the const is "lost" when container_of() is called. Fix this up by properly preserving the const pointer attribute when container_of() is used to enforce the fact that this pointer should not have anything at it changed. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/2025121731-facing-unhitched-63ae@gregkh
2025-12-19Merge tag 'trace-v6.19-rc1' of ↵Linus Torvalds-3/+8
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Add Documentation/core-api/tracepoint.rst to TRACING in MAINTAINERS file Updates to the tracepoint.rst document should be reviewed by the tracing maintainers. - Fix warning triggered by perf attaching to synthetic events The synthetic events do not add a function to be registered when perf attaches to them. This causes a warning when perf registers a synthetic event and passes a NULL pointer to the tracepoint register function. Ideally synthetic events should be updated to work with perf, but as that's a feature and not a bug fix, simply now return -ENODEV when perf tries to register an event that has a NULL pointer for its function. This no longer causes a kernel warning and simply causes the perf code to fail with an error message. - Fix 32bit overflow in option flag test The option's flags changed from 32 bits in size to 64 bits in size. Fix one of the places that shift 1 by the option bit number to to be 1ULL. - Fix the output of printing the direct jmp functions The enabled_functions that shows how functions are being attached by ftrace wasn't updated to accommodate the new direct jmp trampolines that set the LSB of the pointer, and outputs garbage. Update the output to handle the direct jmp trampolines. * tag 'trace-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Fix address for jmp mode in t_show() tracing: Fix UBSAN warning in __remove_instance() tracing: Do not register unsupported perf events MAINTAINERS: add tracepoint core-api doc files to TRACING
2025-12-19Merge tag 'net-6.19-rc2' of ↵Linus Torvalds-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter and CAN. Current release - regressions: - netfilter: nf_conncount: fix leaked ct in error paths - sched: act_mirred: fix loop detection - sctp: fix potential deadlock in sctp_clone_sock() - can: fix build dependency - eth: mlx5e: do not update BQL of old txqs during channel reconfiguration Previous releases - regressions: - sched: ets: always remove class from active list before deleting it - inet: frags: flush pending skbs in fqdir_pre_exit() - netfilter: nf_nat: remove bogus direction check - mptcp: - schedule rtx timer only after pushing data - avoid deadlock on fallback while reinjecting - can: gs_usb: fix error handling - eth: - mlx5e: - avoid unregistering PSP twice - fix double unregister of HCA_PORTS component - bnxt_en: fix XDP_TX path - mlxsw: fix use-after-free when updating multicast route stats Previous releases - always broken: - ethtool: avoid overflowing userspace buffer on stats query - openvswitch: fix middle attribute validation in push_nsh() action - eth: - mlx5: fw_tracer, validate format string parameters - mlxsw: spectrum_router: fix neighbour use-after-free - ipvlan: ignore PACKET_LOOPBACK in handle_mode_l2() Misc: - Jozsef Kadlecsik retires from maintaining netfilter - tools: ynl: fix build on systems with old kernel headers" * tag 'net-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits) net: hns3: add VLAN id validation before using net: hns3: using the num_tqps to check whether tqp_index is out of range when vf get ring info from mbx net: hns3: using the num_tqps in the vf driver to apply for resources net: enetc: do not transmit redirected XDP frames when the link is down selftests/tc-testing: Test case exercising potential mirred redirect deadlock net/sched: act_mirred: fix loop detection sctp: Clear inet_opt in sctp_v6_copy_ip_options(). sctp: Fetch inet6_sk() after setting ->pinet6 in sctp_clone_sock(). net/handshake: duplicate handshake cancellations leak socket net/mlx5e: Don't include PSP in the hard MTU calculations net/mlx5e: Do not update BQL of old txqs during channel reconfiguration net/mlx5e: Trigger neighbor resolution for unresolved destinations net/mlx5e: Use ip6_dst_lookup instead of ipv6_dst_lookup_flow for MAC init net/mlx5: Serialize firmware reset with devlink net/mlx5: fw_tracer, Handle escaped percent properly net/mlx5: fw_tracer, Validate format string parameters net/mlx5: Drain firmware reset in shutdown callback net/mlx5: fw reset, clear reset requested on drain_fw_reset net: dsa: mxl-gsw1xx: manually clear RANEG bit net: dsa: mxl-gsw1xx: fix .shutdown driver operation ...
2025-12-18cpuset: remove v1-specific code from generate_sched_domainsChen Ridong-129/+28
Following the introduction of cpuset1_generate_sched_domains() for v1 in the previous patch, v1-specific logic can now be removed from the generic generate_sched_domains(). This patch cleans up the v1-only code and ensures uf_node is only visible when CONFIG_CPUSETS_V1=y. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: separate generate_sched_domains for v1 and v2Chen Ridong-27/+185
The generate_sched_domains() function currently handles both v1 and v2 logic. However, the underlying mechanisms for building scheduler domains differ significantly between the two versions. For cpuset v2, scheduler domains are straightforwardly derived from valid partitions, whereas cpuset v1 employs a more complex union-find algorithm to merge overlapping cpusets. Co-locating these implementations complicates maintenance. This patch, along with subsequent ones, aims to separate the v1 and v2 logic. For ease of review, this patch first copies the generate_sched_domains() function into cpuset-v1.c as cpuset1_generate_sched_domains() and removes v2-specific code. Common helpers and top_cpuset are declared in cpuset-internal.h. When operating in v1 mode, the code now calls cpuset1_generate_sched_domains(). Currently there is some code duplication, which will be largely eliminated once v1-specific code is removed from v2 in the following patch. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: move update_domain_attr_tree to cpuset_v1.cChen Ridong-34/+36
Since relax_domain_level is only applicable to v1, move update_domain_attr_tree() to cpuset-v1.c, which solely updates relax_domain_level, Additionally, relax_domain_level is now initialized in cpuset1_inited. Accordingly, the initialization of relax_domain_level in top_cpuset is removed. The unnecessary remote_partition initialization in top_cpuset is also cleaned up. As a result, relax_domain_level can be defined in cpuset only when CONFIG_CPUSETS_V1=y. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: add cpuset1_init helper for v1 initializationChen Ridong-7/+14
This patch introduces the cpuset1_init helper in cpuset_v1.c to initialize v1-specific fields, including the fmeter and relax_domain_level members. The relax_domain_level related code will be moved to cpuset_v1.c in a subsequent patch. After this move, v1-specific members will only be visible when CONFIG_CPUSETS_V1=y. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: add cpuset1_online_css helper for v1-specific operationsChen Ridong-38/+51
This commit introduces the cpuset1_online_css helper to centralize v1-specific handling during cpuset online. It performs operations such as updating the CS_SPREAD_PAGE, CS_SPREAD_SLAB, and CGRP_CPUSET_CLONE_CHILDREN flags, which are unique to the cpuset v1 control group interface. The helper is now placed in cpuset-v1.c to maintain clear separation between v1 and v2 logic. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: add lockdep_assert_cpuset_lock_held helperChen Ridong-0/+5
Add lockdep_assert_cpuset_lock_held() to allow other subsystems to verify that cpuset_mutex is held. Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: fix warning when disabling remote partitionChen Ridong-5/+16
A warning was triggered as follows: WARNING: kernel/cgroup/cpuset.c:1651 at remote_partition_disable+0xf7/0x110 RIP: 0010:remote_partition_disable+0xf7/0x110 RSP: 0018:ffffc90001947d88 EFLAGS: 00000206 RAX: 0000000000007fff RBX: ffff888103b6e000 RCX: 0000000000006f40 RDX: 0000000000006f00 RSI: ffffc90001947da8 RDI: ffff888103b6e000 RBP: ffff888103b6e000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: ffff88810b2e2728 R12: ffffc90001947da8 R13: 0000000000000000 R14: ffffc90001947da8 R15: ffff8881081f1c00 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f55c8bbe0b2 CR3: 000000010b14c000 CR4: 00000000000006f0 Call Trace: <TASK> update_prstate+0x2d3/0x580 cpuset_partition_write+0x94/0xf0 kernfs_fop_write_iter+0x147/0x200 vfs_write+0x35d/0x500 ksys_write+0x66/0xe0 do_syscall_64+0x6b/0x390 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f55c8cd4887 Reproduction steps (on a 16-CPU machine): # cd /sys/fs/cgroup/ # mkdir A1 # echo +cpuset > A1/cgroup.subtree_control # echo "0-14" > A1/cpuset.cpus.exclusive # mkdir A1/A2 # echo "0-14" > A1/A2/cpuset.cpus.exclusive # echo "root" > A1/A2/cpuset.cpus.partition # echo 0 > /sys/devices/system/cpu/cpu15/online # echo member > A1/A2/cpuset.cpus.partition When CPU 15 is offlined, subpartitions_cpus gets cleared because no CPUs remain available for the top_cpuset, forcing partitions to share CPUs with the top_cpuset. In this scenario, disabling the remote partition triggers a warning stating that effective_xcpus is not a subset of subpartitions_cpus. Partitions should be invalidated in this case to inform users that the partition is now invalid(cpus are shared with top_cpuset). To fix this issue: 1. Only emit the warning only if subpartitions_cpus is not empty and the effective_xcpus is not a subset of subpartitions_cpus. 2. During the CPU hotplug process, invalidate partitions if subpartitions_cpus is empty. Fixes: f62a5d39368e ("cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18test-ww_mutex: Allow test to be run (and re-run) from userlandJohn Stultz-2/+49
In cases where the ww_mutex test was occasionally tripping on hard to find issues, leaving qemu in a reboot loop was my best way to reproduce problems. These reboots however wasted time when I just wanted to run the test-ww_mutex logic. So tweak the test-ww_mutex test so that it can be re-triggered via a sysfs file, so the test can be run repeatedly without doing module loads or restarting. This has been particularly valuable to stressing and finding issues with the proxy-exec series. To use, run as root: echo 1 > /sys/kernel/test_ww_mutex/run_tests Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251205013515.759030-4-jstultz@google.com
2025-12-18test-ww_mutex: Move work to its own UNBOUND workqueueJohn Stultz-2/+2
The test-ww_mutex test already allocates its own workqueue so be sure to use it for the mtx.work and abba.work rather then the default system workqueue. This resolves numerous messages of the sort: "workqueue: test_abba_work hogged CPU... consider switching to WQ_UNBOUND" "workqueue: test_mutex_work hogged CPU... consider switching to WQ_UNBOUND" Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251205013515.759030-3-jstultz@google.com
2025-12-18test-ww_mutex: Extend ww_mutex tests to test both classes of ww_mutexesJohn Stultz-41/+73
Currently the test-ww_mutex tool only utilizes the wait-die class of ww_mutexes, and thus isn't very helpful in exercising the wait-wound class of ww_mutexes. So extend the test to exercise both classes of ww_mutexes for all of the subtests. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251205013515.759030-2-jstultz@google.com
2025-12-17ftrace: Fix address for jmp mode in t_show()Menglong Dong-2/+5
The address from ftrace_find_rec_direct() is printed directly in t_show(). This can mislead symbol offsets if it has the "jmp" bit in the last bit. Fix this by printing the address that returned by ftrace_jmp_get(). Link: https://patch.msgid.link/20251217030053.80343-1-dongml2@chinatelecom.cn Fixes: 25e4e3565d45 ("ftrace: Introduce FTRACE_OPS_FL_JMP") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-17tracing: Fix UBSAN warning in __remove_instance()Darrick J. Wong-1/+1
xfs/558 triggers the following UBSAN warning: ------------[ cut here ]------------ UBSAN: shift-out-of-bounds in kernel/trace/trace.c:10510:10 shift exponent 32 is too large for 32-bit type 'int' CPU: 1 UID: 0 PID: 888674 Comm: rmdir Not tainted 6.19.0-rc1-xfsx #rc1 PREEMPT(lazy) dbf607ef4c142c563f76d706e71af9731d7b9c90 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-4.module+el8.8.0+21164+ed375313 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4a/0x70 ubsan_epilogue+0x5/0x2b __ubsan_handle_shift_out_of_bounds.cold+0x5e/0x113 __remove_instance.part.0.constprop.0.cold+0x18/0x26f instance_rmdir+0xf3/0x110 tracefs_syscall_rmdir+0x4d/0x90 vfs_rmdir+0x139/0x230 do_rmdir+0x143/0x230 __x64_sys_rmdir+0x1d/0x20 do_syscall_64+0x44/0x230 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f7ae8e51f17 Code: f0 ff ff 73 01 c3 48 8b 0d de 2e 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 54 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 2e 0e 00 f7 d8 64 89 02 b8 RSP: 002b:00007ffd90743f08 EFLAGS: 00000246 ORIG_RAX: 0000000000000054 RAX: ffffffffffffffda RBX: 00007ffd907440f8 RCX: 00007f7ae8e51f17 RDX: 00007f7ae8f3c5c0 RSI: 00007ffd90744a21 RDI: 00007ffd90744a21 RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f7ae8f35ac0 R11: 0000000000000246 R12: 00007ffd90744a21 R13: 0000000000000001 R14: 00007f7ae8f8b000 R15: 000055e5283e6a98 </TASK> ---[ end trace ]--- whilst tearing down an ftrace instance. TRACE_FLAGS_MAX_SIZE is now 64bit, so the mask comparison expression must be typecast to a u64 value to avoid an overflow. AFAICT, ZEROED_TRACE_FLAGS is already cast to ULL so this is ok. Link: https://patch.msgid.link/20251216174950.GA7705@frogsfrogsfrogs Fixes: bbec8e28cac592 ("tracing: Allow tracer to add more than 32 options") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-17tracing: Do not register unsupported perf eventsSteven Rostedt-0/+2
Synthetic events currently do not have a function to register perf events. This leads to calling the tracepoint register functions with a NULL function pointer which triggers: ------------[ cut here ]------------ WARNING: kernel/tracepoint.c:175 at tracepoint_add_func+0x357/0x370, CPU#2: perf/2272 Modules linked in: kvm_intel kvm irqbypass CPU: 2 UID: 0 PID: 2272 Comm: perf Not tainted 6.18.0-ftest-11964-ge022764176fc-dirty #323 PREEMPTLAZY Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:tracepoint_add_func+0x357/0x370 Code: 28 9c e8 4c 0b f5 ff eb 0f 4c 89 f7 48 c7 c6 80 4d 28 9c e8 ab 89 f4 ff 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc <0f> 0b 49 c7 c6 ea ff ff ff e9 ee fe ff ff 0f 0b e9 f9 fe ff ff 0f RSP: 0018:ffffabc0c44d3c40 EFLAGS: 00010246 RAX: 0000000000000001 RBX: ffff9380aa9e4060 RCX: 0000000000000000 RDX: 000000000000000a RSI: ffffffff9e1d4a98 RDI: ffff937fcf5fd6c8 RBP: 0000000000000001 R08: 0000000000000007 R09: ffff937fcf5fc780 R10: 0000000000000003 R11: ffffffff9c193910 R12: 000000000000000a R13: ffffffff9e1e5888 R14: 0000000000000000 R15: ffffabc0c44d3c78 FS: 00007f6202f5f340(0000) GS:ffff93819f00f000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055d3162281a8 CR3: 0000000106a56003 CR4: 0000000000172ef0 Call Trace: <TASK> tracepoint_probe_register+0x5d/0x90 synth_event_reg+0x3c/0x60 perf_trace_event_init+0x204/0x340 perf_trace_init+0x85/0xd0 perf_tp_event_init+0x2e/0x50 perf_try_init_event+0x6f/0x230 ? perf_event_alloc+0x4bb/0xdc0 perf_event_alloc+0x65a/0xdc0 __se_sys_perf_event_open+0x290/0x9f0 do_syscall_64+0x93/0x7b0 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e ? trace_hardirqs_off+0x53/0xc0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Instead, have the code return -ENODEV, which doesn't warn and has perf error out with: # perf record -e synthetic:futex_wait Error: The sys_perf_event_open() syscall returned with 19 (No such device) for event (synthetic:futex_wait). "dmesg | grep -i perf" may provide additional information. Ideally perf should support synthetic events, but for now just fix the warning. The support can come later. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20251216182440.147e4453@gandalf.local.home Fixes: 4b147936fa509 ("tracing: Add support for 'synthetic' events") Reported-by: Ian Rogers <irogers@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-17perf: Clean up mediated vPMU accountingPeter Zijlstra-2/+7
The mediated_pmu_account_event() and perf_create_mediated_pmu() functions implement the exclusion between '!exclude_guest' counters and mediated vPMUs. Their implementation is basically identical, except mirrored in what they count/check. Make sure the actual implementations reflect this similarity. Notably: - while perf_release_mediated_pmu() has an underflow check; mediated_pmu_unaccount_event() did not. - while perf_create_mediated_pmu() has an inc_not_zero() path; mediated_pmu_account_event() did not. Also, the inc_not_zero() path can be outsite of perf_mediated_pmu_mutex. The mutex must guard the 0->1 (of either nr_include_guest_events or nr_mediated_pmu_vms) transition, but once a counter is already non-zero, it can safely be incremented further. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251208115156.GE3707891@noisy.programming.kicks-ass.net
2025-12-17unwind_user/fp: Use dummies instead of ifdefJens Remus-4/+0
This simplifies the code. unwind_user_next_fp() does not need to return -EINVAL if config option HAVE_UNWIND_USER_FP is disabled, as unwind_user_start() will then not select this unwind method and unwind_user_next() will therefore not call it. Provide (1) a dummy definition of ARCH_INIT_USER_FP_FRAME, if the unwind user method HAVE_UNWIND_USER_FP is not enabled, (2) a common fallback definition of unwind_user_at_function_start() which returns false, and (3) a common dummy definition of ARCH_INIT_USER_FP_ENTRY_FRAME. Note that enabling the config option HAVE_UNWIND_USER_FP without defining ARCH_INIT_USER_FP_FRAME triggers a compile error, which is helpful when implementing support for this unwind user method in an architecture. Enabling the config option when providing an arch- specific unwind_user_at_function_start() definition makes it necessary to also provide an arch-specific ARCH_INIT_USER_FP_ENTRY_FRAME definition. Signed-off-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251208160352.1363040-3-jremus@linux.ibm.com
2025-12-17unwind_user: Enhance comments on get CFA, FP, and RAJens Remus-4/+4
Move the comment "Get the Canonical Frame Address (CFA)" to the top of the sequence of statements that actually get the CFA. Reword the comment "Find the Return Address (RA)" to "Get ...", as the statements actually get the RA. Add a respective comment to the statements that get the FP. This will be useful once future commits extend the logic to get the RA and FP. While at it align the comment on the "stack going in wrong direction" check to the following one on the "address is word aligned" check. Signed-off-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251208160352.1363040-2-jremus@linux.ibm.com
2025-12-17perf/x86/core: Register a new vector for handling mediated guest PMIsSean Christopherson-2/+7
Wire up system vector 0xf5 for handling PMIs (i.e. interrupts delivered through the LVTPC) while running KVM guests with a mediated PMU. Perf currently delivers all PMIs as NMIs, e.g. so that events that trigger while IRQs are disabled aren't delayed and generate useless records, but due to the multiplexing of NMIs throughout the system, correctly identifying NMIs for a mediated PMU is practically infeasible. To (greatly) simplify identifying guest mediated PMU PMIs, perf will switch the CPU's LVTPC between PERF_GUEST_MEDIATED_PMI_VECTOR and NMI when guest PMU context is loaded/put. I.e. PMIs that are generated by the CPU while the guest is active will be identified purely based on the IRQ vector. Route the vector through perf, e.g. as opposed to letting KVM attach a handler directly a la posted interrupt notification vectors, as perf owns the LVTPC and thus is the rightful owner of PERF_GUEST_MEDIATED_PMI_VECTOR. Functionally, having KVM directly own the vector would be fine (both KVM and perf will be completely aware of when a mediated PMU is active), but would lead to an undesirable split in ownership: perf would be responsible for installing the vector, but not handling the resulting IRQs. Add a new perf_guest_info_callbacks hook (and static call) to allow KVM to register its handler with perf when running guests with mediated PMUs. Note, because KVM always runs guests with host IRQs enabled, there is no danger of a PMI being delayed from the guest's perspective due to using a regular IRQ instead of an NMI. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-9-seanjc@google.com
2025-12-17perf: Add APIs to load/put guest mediated PMU contextKan Liang-0/+61
Add exported APIs to load/put a guest mediated PMU context. KVM will load the guest PMU shortly before VM-Enter, and put the guest PMU shortly after VM-Exit. On the perf side of things, schedule out all exclude_guest events when the guest context is loaded, and schedule them back in when the guest context is put. I.e. yield the hardware PMU resources to the guest, by way of KVM. Note, perf is only responsible for managing host context. KVM is responsible for loading/storing guest state to/from hardware. [sean: shuffle patches around, write changelog] Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-8-seanjc@google.com