aboutsummaryrefslogtreecommitdiffstats
path: root/tools/perf/scripts/python/stackcollapse.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2025-08-27KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memoryAckerley Tng1-4/+4
Update the KVM MMU fault handler to service guest page faults for memory slots backed by guest_memfd with mmap support. For such slots, the MMU must always fault in pages directly from guest_memfd, bypassing the host's userspace_addr. This ensures that guest_memfd-backed memory is always handled through the guest_memfd specific faulting path, regardless of whether it's for private or non-private (shared) use cases. Additionally, rename kvm_mmu_faultin_pfn_private() to kvm_mmu_faultin_pfn_gmem(), as this function is now used to fault in pages from guest_memfd for both private and non-private memory, accommodating the new use cases. Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Co-developed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> [sean: drop the helper] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20250729225455.670324-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappingsSean Christopherson7-13/+19
Rework kvm_mmu_max_mapping_level() to consult guest_memfd for all mappings, not just private mappings, so that hugepage support plays nice with the upcoming support for backing non-private memory with guest_memfd. In addition to getting the max order from guest_memfd for gmem-only memslots, update TDX's hook to effectively ignore shared mappings, as TDX's restrictions on page size only apply to Secure EPT mappings. Do nothing for SNP, as RMP restrictions apply to both private and shared memory. Suggested-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20250729225455.670324-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: x86/mmu: Enforce guest_memfd's max order when recovering hugepagesSean Christopherson3-35/+47
Rework kvm_mmu_max_mapping_level() to provide the plumbing to consult guest_memfd (and relevant vendor code) when recovering hugepages, e.g. after disabling live migration. The flaw has existed since guest_memfd was originally added, but has gone unnoticed due to lack of guest_memfd support for hugepages or dirty logging. Don't actually call into guest_memfd at this time, as it's unclear as to what the API should be. Ideally, KVM would simply use kvm_gmem_get_pfn(), but invoking kvm_gmem_get_pfn() would lead to sleeping in atomic context if guest_memfd needed to allocate memory (mmu_lock is held). Luckily, the path isn't actually reachable, so just add a TODO and WARN to ensure the functionality is added alongisde guest_memfd hugepage support, and punt the guest_memfd API design question to the future. Note, calling kvm_mem_is_private() in the non-fault path is safe, so long as mmu_lock is held, as hugepage recovery operates on shadow-present SPTEs, i.e. calling kvm_mmu_max_mapping_level() with @fault=NULL is mutually exclusive with kvm_vm_set_mem_attributes() changing the PRIVATE attribute of the gfn. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Message-ID: <20250729225455.670324-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: x86/mmu: Hoist guest_memfd max level/order helpers "up" in mmu.cSean Christopherson1-36/+36
Move kvm_max_level_for_order() and kvm_max_private_mapping_level() up in mmu.c so that they can be used by __kvm_mmu_max_mapping_level(). Opportunistically drop the "inline" from kvm_max_level_for_order(). No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Message-ID: <20250729225455.670324-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: x86/mmu: Rename .private_max_mapping_level() to .gmem_max_mapping_level()Ackerley Tng9-12/+12
Rename kvm_x86_ops.private_max_mapping_level() to .gmem_max_mapping_level() in anticipation of extending guest_memfd support to non-private memory. No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Message-ID: <20250729225455.670324-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: guest_memfd: Track guest_memfd mmap support in memslotFuad Tabba2-1/+12
Add a new internal flag, KVM_MEMSLOT_GMEM_ONLY, to the top half of memslot->flags (which makes it strictly for KVM's internal use). This flag tracks when a guest_memfd-backed memory slot supports host userspace mmap operations, which implies that all memory, not just private memory for CoCo VMs, is consumed through guest_memfd: "gmem only". This optimization avoids repeatedly checking the underlying guest_memfd file for mmap support, which would otherwise require taking and releasing a reference on the file for each check. By caching this information directly in the memslot, we reduce overhead and simplify the logic involved in handling guest_memfd-backed pages for host mappings. Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: guest_memfd: Add plumbing to host to map guest_memfd pagesFuad Tabba3-0/+85
Introduce the core infrastructure to enable host userspace to mmap() guest_memfd-backed memory. This is needed for several evolving KVM use cases: * Non-CoCo VM backing: Allows VMMs like Firecracker to run guests entirely backed by guest_memfd, even for non-CoCo VMs [1]. This provides a unified memory management model and simplifies guest memory handling. * Direct map removal for enhanced security: This is an important step for direct map removal of guest memory [2]. By allowing host userspace to fault in guest_memfd pages directly, we can avoid maintaining host kernel direct maps of guest memory. This provides additional hardening against Spectre-like transient execution attacks by removing a potential attack surface within the kernel. * Future guest_memfd features: This also lays the groundwork for future enhancements to guest_memfd, such as supporting huge pages and enabling in-place sharing of guest memory with the host for CoCo platforms that permit it [3]. Enable the basic mmap and fault handling logic within guest_memfd, but hold off on allow userspace to actually do mmap() until the architecture support is also in place. [1] https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding [2] https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com [3] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com/T/#u Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Co-developed-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: x86: Enable KVM_GUEST_MEMFD for all 64-bit buildsFuad Tabba4-20/+9
Enable KVM_GUEST_MEMFD for all KVM x86 64-bit builds, i.e. for "default" VM types when running on 64-bit KVM. This will allow using guest_memfd to back non-private memory for all VM shapes, by supporting mmap() on guest_memfd. Opportunistically clean up various conditionals that become tautologies once x86 selects KVM_GUEST_MEMFD more broadly. Specifically, because SW protected VMs, SEV, and TDX are all 64-bit only, private memory no longer needs to take explicit dependencies on KVM_GUEST_MEMFD, because it is effectively a prerequisite. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: Fix comment that refers to kvm uapi header pathFuad Tabba1-1/+1
The comment that points to the path where the user-visible memslot flags are refers to an outdated path and has a typo. Update the comment to refer to the correct path. Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: Fix comments that refer to slots_lockFuad Tabba2-2/+2
Fix comments so that they refer to slots_lock instead of slots_locks (remove trailing s). Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem()Fuad Tabba4-6/+6
Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() to improve clarity and accurately reflect its purpose. The function kvm_slot_can_be_private() was previously used to check if a given kvm_memory_slot is backed by guest_memfd. However, its name implied that the memory in such a slot was exclusively "private". As guest_memfd support expands to include non-private memory (e.g., shared host mappings), it's important to remove this association. The new name, kvm_slot_has_gmem(), states that the slot is backed by guest_memfd without making assumptions about the memory's privacy attributes. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_HAVE_KVM_ARCH_GMEM_POPULATEFuad Tabba4-11/+16
The original name was vague regarding its functionality. This Kconfig option specifically enables and gates the kvm_gmem_populate() function, which is responsible for populating a GPA range with guest data. The new name, HAVE_KVM_ARCH_GMEM_POPULATE, describes the purpose of the option: to enable arch-specific guest_memfd population mechanisms. It also follows the same pattern as the other HAVE_KVM_ARCH_* configuration options. This improves clarity for developers and ensures the name accurately reflects the functionality it controls, especially as guest_memfd support expands beyond purely "private" memory scenarios. Temporarily keep KVM_GENERIC_PRIVATE_MEM as an x86-only config so as to minimize churn, and to hopefully make it easier to see what features require HAVE_KVM_ARCH_GMEM_POPULATE. On that note, omit GMEM_POPULATE for KVM_X86_SW_PROTECTED_VM, as regular ol' memset() suffices for software-protected VMs. As for KVM_GENERIC_PRIVATE_MEM, a future change will select KVM_GUEST_MEMFD for all 64-bit KVM builds, at which point the intermediate config will become obsolete and can/will be dropped. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: x86: Select TDX's KVM_GENERIC_xxx dependencies iff CONFIG_KVM_INTEL_TDX=ySean Christopherson1-2/+2
Select KVM_GENERIC_PRIVATE_MEM and KVM_GENERIC_MEMORY_ATTRIBUTES directly from KVM_INTEL_TDX, i.e. if and only if TDX support is fully enabled in KVM. There is no need to enable KVM's private memory support just because the core kernel's INTEL_TDX_HOST is enabled. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Message-ID: <20250729225455.670324-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: x86: Select KVM_GENERIC_PRIVATE_MEM directly from KVM_SW_PROTECTED_VMSean Christopherson1-1/+1
Now that KVM_SW_PROTECTED_VM doesn't have a hidden dependency on KVM_X86, select KVM_GENERIC_PRIVATE_MEM from within KVM_SW_PROTECTED_VM instead of conditionally selecting it from KVM_X86. No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Message-ID: <20250729225455.670324-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: x86: Have all vendor neutral sub-configs depend on KVM_X86, not just KVMSean Christopherson1-8/+8
Make all vendor neutral KVM x86 configs depend on KVM_X86, not just KVM, i.e. gate them on at least one vendor module being enabled and thus on kvm.ko actually being built. Depending on just KVM allows the user to select the configs even though they won't actually take effect, and more importantly, makes it all too easy to create unmet dependencies. E.g. KVM_GENERIC_PRIVATE_MEM can't be selected by KVM_SW_PROTECTED_VM, because the KVM_GENERIC_MMU_NOTIFIER dependency is select by KVM_X86. Hiding all sub-configs when neither KVM_AMD nor KVM_INTEL is selected also helps communicate to the user that nothing "interesting" is going on, e.g. --- Virtualization <M> Kernel-based Virtual Machine (KVM) support < > KVM for Intel (and compatible) processors support < > KVM for AMD processors support Fixes: ea4290d77bda ("KVM: x86: leave kvm.ko out of the build if no vendor module is requested") Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Message-ID: <20250729225455.670324-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFDFuad Tabba6-17/+17
Rename the Kconfig option CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFD. The original name implied that the feature only supported "private" memory. However, CONFIG_KVM_PRIVATE_MEM enables guest_memfd in general, which is not exclusively for private memory. Subsequent patches in this series will add guest_memfd support for non-CoCo VMs, whose memory is not private. Renaming the Kconfig option to CONFIG_KVM_GUEST_MEMFD more accurately reflects its broader scope as the main Kconfig option for all guest_memfd-backed memory. This provides clearer semantics for the option and avoids confusion as new features are introduced. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-24Linux 6.17-rc3v6.17-rc3Linus Torvalds1-1/+1
2025-08-22ftrace: Also allocate and copy hash for reading of filter filesSteven Rostedt1-9/+10
Currently the reader of set_ftrace_filter and set_ftrace_notrace just adds the pointer to the global tracer hash to its iterator. Unlike the writer that allocates a copy of the hash, the reader keeps the pointer to the filter hashes. This is problematic because this pointer is static across function calls that release the locks that can update the global tracer hashes. This can cause UAF and similar bugs. Allocate and copy the hash for reading the filter files like it is done for the writers. This not only fixes UAF bugs, but also makes the code a bit simpler as it doesn't have to differentiate when to free the iterator's hash between writers and readers. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20250822183606.12962cc3@batman.local.home Fixes: c20489dad156 ("ftrace: Assign iter->hash to filter or notrace hashes on seq read") Closes: https://lore.kernel.org/all/20250813023044.2121943-1-wutengda@huaweicloud.com/ Closes: https://lore.kernel.org/all/20250822192437.GA458494@ax162/ Reported-by: Tengda Wu <wutengda@huaweicloud.com> Tested-by: Tengda Wu <wutengda@huaweicloud.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22ftrace: Fix potential warning in trace_printk_seq during ftrace_dumpTengda Wu1-2/+2
When calling ftrace_dump_one() concurrently with reading trace_pipe, a WARN_ON_ONCE() in trace_printk_seq() can be triggered due to a race condition. The issue occurs because: CPU0 (ftrace_dump) CPU1 (reader) echo z > /proc/sysrq-trigger !trace_empty(&iter) trace_iterator_reset(&iter) <- len = size = 0 cat /sys/kernel/tracing/trace_pipe trace_find_next_entry_inc(&iter) __find_next_entry ring_buffer_empty_cpu <- all empty return NULL trace_printk_seq(&iter.seq) WARN_ON_ONCE(s->seq.len >= s->seq.size) In the context between trace_empty() and trace_find_next_entry_inc() during ftrace_dump, the ring buffer data was consumed by other readers. This caused trace_find_next_entry_inc to return NULL, failing to populate `iter.seq`. At this point, due to the prior trace_iterator_reset, both `iter.seq.len` and `iter.seq.size` were set to 0. Since they are equal, the WARN_ON_ONCE condition is triggered. Move the trace_printk_seq() into the if block that checks to make sure the return value of trace_find_next_entry_inc() is non-NULL in ftrace_dump_one(), ensuring the 'iter.seq' is properly populated before subsequent operations. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@elte.hu> Link: https://lore.kernel.org/20250822033343.3000289-1-wutengda@huaweicloud.com Fixes: d769041f8653 ("ring_buffer: implement new locking") Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22fgraph: Copy args in intermediate storage with entrySteven Rostedt1-6/+16
The output of the function graph tracer has two ways to display its entries. One way for leaf functions with no events recorded within them, and the other is for functions with events recorded inside it. As function graph has an entry and exit event, to simplify the output of leaf functions it combines the two, where as non leaf functions are separate: 2) | invoke_rcu_core() { 2) | raise_softirq() { 2) 0.391 us | __raise_softirq_irqoff(); 2) 1.191 us | } 2) 2.086 us | } The __raise_softirq_irqoff() function above is really two events that were merged into one. Otherwise it would have looked like: 2) | invoke_rcu_core() { 2) | raise_softirq() { 2) | __raise_softirq_irqoff() { 2) 0.391 us | } 2) 1.191 us | } 2) 2.086 us | } In order to do this merge, the reading of the trace output file needs to look at the next event before printing. But since the pointer to the event is on the ring buffer, it needs to save the entry event before it looks at the next event as the next event goes out of focus as soon as a new event is read from the ring buffer. After it reads the next event, it will print the entry event with either the '{' (non leaf) or ';' and timestamps (leaf). The iterator used to read the trace file has storage for this event. The problem happens when the function graph tracer has arguments attached to the entry event as the entry now has a variable length "args" field. This field only gets set when funcargs option is used. But the args are not recorded in this temp data and garbage could be printed. The entry field is copied via: data->ent = *curr; Where "curr" is the entry field. But this method only saves the non variable length fields from the structure. Add a helper structure to the iterator data that adds the max args size to the data storage in the iterator. Then simply copy the entire entry into this storage (with size protection). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/20250820195522.51d4a268@gandalf.local.home Reported-by: Sasha Levin <sashal@kernel.org> Tested-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/all/aJaxRVKverIjF4a6@lappy/ Fixes: ff5c9c576e75 ("ftrace: Add support for function argument to graph tracer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22mips: lantiq: xway: sysctrl: rename the etop nodeAleksander Jan Bajkowski2-6/+6
Bindig requires a node name matching ‘^ethernet@[0-9a-f]+$’. This patch changes the clock name from “etop” to “ethernet”. This fixes the following warning: arch/mips/boot/dts/lantiq/danube_easy50712.dtb: etop@e180000 (lantiq,etop-xway): $nodename:0: 'etop@e180000' does not match '^ethernet@[0-9a-f]+$' from schema $id: http://devicetree.org/schemas/net/lantiq,etop-xway.yaml# Fixes: dac0bad93741 ("dt-bindings: net: lantiq,etop-xway: Document Lantiq Xway ETOP bindings") Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Acked-by: Jakub Kicinski <kuba@kernel.org>
2025-08-22mips: dts: lantiq: danube: add missing burst length propertyAleksander Jan Bajkowski1-0/+3
The upstream dts lacks the lantiq,{rx/tx}-burst-length property. Other issues were also fixed: arch/mips/boot/dts/lantiq/danube_easy50712.dtb: etop@e180000 (lantiq,etop-xway): 'interrupt-names' is a required property from schema $id: http://devicetree.org/schemas/net/lantiq,etop-xway.yaml# arch/mips/boot/dts/lantiq/danube_easy50712.dtb: etop@e180000 (lantiq,etop-xway): 'lantiq,tx-burst-length' is a required property from schema $id: http://devicetree.org/schemas/net/lantiq,etop-xway.yaml# arch/mips/boot/dts/lantiq/danube_easy50712.dtb: etop@e180000 (lantiq,etop-xway): 'lantiq,rx-burst-length' is a required property from schema $id: http://devicetree.org/schemas/net/lantiq,etop-xway.yaml# Fixes: 14d4e308e0aa ("net: lantiq: configure the burst length in ethernet drivers") Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Acked-by: Jakub Kicinski <kuba@kernel.org>
2025-08-22iommu/riscv: prevent NULL deref in iova_to_physXianLiang Huang1-1/+1
The riscv_iommu_pte_fetch() function returns either NULL for unmapped/never-mapped iova, or a valid leaf pte pointer that requires no further validation. riscv_iommu_iova_to_phys() failed to handle NULL returns. Prevent null pointer dereference in riscv_iommu_iova_to_phys(), and remove the pte validation. Fixes: 488ffbf18171 ("iommu/riscv: Paging domain support") Cc: Tomasz Jeznach <tjeznach@rivosinc.com> Signed-off-by: XianLiang Huang <huangxianliang@lanxincomputing.com> Link: https://lore.kernel.org/r/20250820072248.312-1-huangxianliang@lanxincomputing.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-08-22iommu/virtio: Make instance lookup robustRobin Murphy1-6/+9
Much like arm-smmu in commit 7d835134d4e1 ("iommu/arm-smmu: Make instance lookup robust"), virtio-iommu appears to have the same issue where iommu_device_register() makes the IOMMU instance visible to other API callers (including itself) straight away, but internally the instance isn't ready to recognise itself for viommu_probe_device() to work correctly until after viommu_probe() has returned. This matters a lot more now that bus_iommu_probe() has the DT/VIOT knowledge to probe client devices the way that was always intended. Tweak the lookup and initialisation in much the same way as for arm-smmu, to ensure that what we register is functional and ready to go. Cc: stable@vger.kernel.org Fixes: bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path") Signed-off-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/308911aaa1f5be32a3a709996c7bd6cf71d30f33.1755190036.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-08-22iommu/arm-smmu-v3: Fix smmu_domain->nr_ats_masters decrementNicolin Chen1-1/+1
The arm_smmu_attach_commit() updates master->ats_enabled before calling arm_smmu_remove_master_domain() that is supposed to clean up everything in the old domain, including the old domain's nr_ats_masters. So, it is supposed to use the old ats_enabled state of the device, not an updated state. This isn't a problem if switching between two domains where: - old ats_enabled = false; new ats_enabled = false - old ats_enabled = true; new ats_enabled = true but can fail cases where: - old ats_enabled = false; new ats_enabled = true (old domain should keep the counter but incorrectly decreased it) - old ats_enabled = true; new ats_enabled = false (old domain needed to decrease the counter but incorrectly missed it) Update master->ats_enabled after arm_smmu_remove_master_domain() to fix this. Fixes: 7497f4211f4f ("iommu/arm-smmu-v3: Make changing domains be hitless for ATS") Cc: stable@vger.kernel.org Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Link: https://lore.kernel.org/r/20250801030127.2006979-1-nicolinc@nvidia.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-08-21io_uring: clear ->async_data as part of normal initJens Axboe1-0/+1
Opcode handlers like POLL_ADD will use ->async_data as the pointer for double poll handling, which is a bit different than the usual case where it's strictly gated by the REQ_F_ASYNC_DATA flag. Be a bit more proactive in handling ->async_data, and clear it to NULL as part of regular init. Init is touching that cacheline anyway, so might as well clear it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21io_uring/futex: ensure io_futex_wait() cleans up properly on failureJens Axboe1-0/+3
The io_futex_data is allocated upfront and assigned to the io_kiocb async_data field, but the request isn't marked with REQ_F_ASYNC_DATA at that point. Those two should always go together, as the flag tells io_uring whether the field is valid or not. Additionally, on failure cleanup, the futex handler frees the data but does not clear ->async_data. Clear the data and the flag in the error path as well. Thanks to Trend Micro Zero Day Initiative and particularly ReDress for reporting this. Cc: stable@vger.kernel.org Fixes: 194bb58c6090 ("io_uring: add support for futex wake and wait") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21drm/xe: Fix vm_bind_ioctl double free bugChristoph Manszewski1-1/+2
If the argument check during an array bind fails, the bind_ops are freed twice as seen below. Fix this by setting bind_ops to NULL after freeing. ================================================================== BUG: KASAN: double-free in xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] Free of addr ffff88813bb9b800 by task xe_vm/14198 CPU: 5 UID: 0 PID: 14198 Comm: xe_vm Not tainted 6.16.0-xe-eudebug-cmanszew+ #520 PREEMPT(full) Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-P DDR5 RVP, BIOS ADLPFWI1.R00.2411.A02.2110081023 10/08/2021 Call Trace: <TASK> dump_stack_lvl+0x82/0xd0 print_report+0xcb/0x610 ? __virt_addr_valid+0x19a/0x300 ? xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] kasan_report_invalid_free+0xc8/0xf0 ? xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] ? xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] check_slab_allocation+0x102/0x130 kfree+0x10d/0x440 ? should_fail_ex+0x57/0x2f0 ? xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] ? __lock_acquire+0xab9/0x27f0 ? lock_acquire+0x165/0x300 ? drm_dev_enter+0x53/0xe0 [drm] ? find_held_lock+0x2b/0x80 ? drm_dev_exit+0x30/0x50 [drm] ? drm_ioctl_kernel+0x128/0x1c0 [drm] drm_ioctl_kernel+0x128/0x1c0 [drm] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] ? find_held_lock+0x2b/0x80 ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm] ? should_fail_ex+0x57/0x2f0 ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] drm_ioctl+0x352/0x620 [drm] ? __pfx_drm_ioctl+0x10/0x10 [drm] ? __pfx_rpm_resume+0x10/0x10 ? do_raw_spin_lock+0x11a/0x1b0 ? find_held_lock+0x2b/0x80 ? __pm_runtime_resume+0x61/0xc0 ? rcu_is_watching+0x20/0x50 ? trace_irq_enable.constprop.0+0xac/0xe0 xe_drm_ioctl+0x91/0xc0 [xe] __x64_sys_ioctl+0xb2/0x100 ? rcu_is_watching+0x20/0x50 do_syscall_64+0x68/0x2e0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fa9acb24ded Fixes: b43e864af0d4 ("drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Christoph Manszewski <christoph.manszewski@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250813101231.196632-2-christoph.manszewski@intel.com (cherry picked from commit a01b704527c28a2fd43a17a85f8996b75ec8492a) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-21drm/xe: Move ASID allocation and user PT BO tracking into xe_vm_createPiotr Piórkowski1-19/+15
Currently, ASID assignment for user VMs and page-table BO accounting for client memory tracking are performed in xe_vm_create_ioctl. To consolidate VM object initialization, move this logic to xe_vm_create. v2: - removed unnecessary duplicate BO tracking code - using the local variable xef to verify whether the VM is being created by userspace Fixes: 658a1c8e0a66 ("drm/xe: Assign ioctl xe file handler to vm in xe_vm_create") Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250811104358.2064150-3-piotr.piorkowski@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com> (cherry picked from commit 30e0c3f43a414616e0b6ca76cf7f7b2cd387e1d4) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> [Rodrigo: Added fixes tag]
2025-08-21netfilter: nf_reject: don't leak dst refcount for loopback packetsFlorian Westphal2-7/+4
recent patches to add a WARN() when replacing skb dst entry found an old bug: WARNING: include/linux/skbuff.h:1165 skb_dst_check_unset include/linux/skbuff.h:1164 [inline] WARNING: include/linux/skbuff.h:1165 skb_dst_set include/linux/skbuff.h:1210 [inline] WARNING: include/linux/skbuff.h:1165 nf_reject_fill_skb_dst+0x2a4/0x330 net/ipv4/netfilter/nf_reject_ipv4.c:234 [..] Call Trace: nf_send_unreach+0x17b/0x6e0 net/ipv4/netfilter/nf_reject_ipv4.c:325 nft_reject_inet_eval+0x4bc/0x690 net/netfilter/nft_reject_inet.c:27 expr_call_ops_eval net/netfilter/nf_tables_core.c:237 [inline] .. This is because blamed commit forgot about loopback packets. Such packets already have a dst_entry attached, even at PRE_ROUTING stage. Instead of checking hook just check if the skb already has a route attached to it. Fixes: f53b9b0bdc59 ("netfilter: introduce support for reject at prerouting stage") Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20250820123707.10671-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21s390/hypfs: Enable limited access during lockdownPeter Oberparleiter1-1/+2
When kernel lockdown is active, debugfs_locked_down() blocks access to hypfs files that register ioctl callbacks, even if the ioctl interface is not required for a function. This unnecessarily breaks userspace tools that only rely on read operations. Resolve this by registering a minimal set of file operations during lockdown, avoiding ioctl registration and preserving access for affected tooling. Note that this change restores hypfs functionality when lockdown is active from early boot (e.g. via lockdown=integrity kernel parameter), but does not apply to scenarios where lockdown is enabled dynamically while Linux is running. Tested-by: Mete Durlu <meted@linux.ibm.com> Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Fixes: 5496197f9b08 ("debugfs: Restrict debugfs when the kernel is locked down") Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-08-21s390/hypfs: Avoid unnecessary ioctl registration in debugfsPeter Oberparleiter1-7/+11
Currently, hypfs registers ioctl callbacks for all debugfs files, despite only one file requiring them. This leads to unintended exposure of unused interfaces to user space and can trigger side effects such as restricted access when kernel lockdown is enabled. Restrict ioctl registration to only those files that implement ioctl functionality to avoid interface clutter and unnecessary access restrictions. Tested-by: Mete Durlu <meted@linux.ibm.com> Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Fixes: 5496197f9b08 ("debugfs: Restrict debugfs when the kernel is locked down") Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-08-21ALSA: usb-audio: Use correct sub-type for UAC3 feature unit validationTakashi Iwai1-1/+1
The entry of the validators table for UAC3 feature unit is defined with a wrong sub-type UAC_FEATURE (= 0x06) while it should have been UAC3_FEATURE (= 0x07). This patch corrects the entry value. Fixes: 57f8770620e9 ("ALSA: usb-audio: More validations of descriptor units") Link: https://patch.msgid.link/20250821150835.8894-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2025-08-21net/mlx5e: Preserve shared buffer capacity during headroom updatesArmen Ratner1-10/+8
When port buffer headroom changes, port_update_shared_buffer() recalculates the shared buffer size and splits it in a 3:1 ratio (lossy:lossless) - Currently, the calculation is: lossless = shared / 4; lossy = (shared / 4) * 3; Meaning, the calculation dropped the remainder of shared % 4 due to integer division, unintentionally reducing the total shared buffer by up to three cells on each update. Over time, this could shrink the buffer below usable size. Fix it by changing the calculation to: lossless = shared / 4; lossy = shared - lossless; This retains all buffer cells while still approximating the intended 3:1 split, preventing capacity loss over time. While at it, perform headroom calculations in units of cells rather than in bytes for more accurate calculations avoiding extra divisions. Fixes: a440030d8946 ("net/mlx5e: Update shared buffer along with device buffer changes") Signed-off-by: Armen Ratner <armeng@nvidia.com> Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Alexei Lazar <alazar@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/20250820133209.389065-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5e: Query FW for buffer ownershipAlexei Lazar4-4/+31
The SW currently saves local buffer ownership when setting the buffer. This means that the SW assumes it has ownership of the buffer after the command is set. If setting the buffer fails and we remain in FW ownership, the local buffer ownership state incorrectly remains as SW-owned. This leads to incorrect behavior in subsequent PFC commands, causing failures. Instead of saving local buffer ownership in SW, query the FW for buffer ownership when setting the buffer. This ensures that the buffer ownership state is accurately reflected, avoiding the issues caused by incorrect ownership states. Fixes: ecdf2dadee8e ("net/mlx5e: Receive buffer support for DCBX") Signed-off-by: Alexei Lazar <alazar@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Restore missing scheduling node cleanup on vport enable failureCarolina Jubran1-0/+1
Restore the __esw_qos_free_node() call removed by the offending commit. Fixes: 97733d1e00a0 ("net/mlx5: Add traffic class scheduling support for vport QoS") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Fix QoS reference leak in vport enable error pathCarolina Jubran1-1/+3
Add missing esw_qos_put() call when __esw_qos_alloc_node() fails in mlx5_esw_qos_vport_enable(). Fixes: be034baba83e ("net/mlx5: Make vport QoS enablement more flexible for future extensions") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Destroy vport QoS element when no configuration remainsCarolina Jubran1-8/+49
If a VF has been configured and the user later clears all QoS settings, the vport element remains in the firmware QoS tree. This leads to inconsistent behavior compared to VFs that were never configured, since the FW assumes that unconfigured VFs are outside the QoS hierarchy. As a result, the bandwidth share across VFs may differ, even though none of them appear to have any configuration. Align the driver behavior with the FW expectation by destroying the vport QoS element when all configurations are removed. Fixes: c9497c98901c ("net/mlx5: Add support for setting VF min rate") Fixes: cf7e73770d1b ("net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/20250820133209.389065-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5e: Preserve tc-bw during parent changesCarolina Jubran1-12/+12
When changing parent of a node/leaf with tc-bw configured, the code saves and restores tc-bw values. However, it was reading the converted hardware bw_share values (where 0 becomes 1) instead of the original user values, causing incorrect tc-bw calculations after parent change. Store original tc-bw values in the node structure and use them directly for save/restore operations. Fixes: cf7e73770d1b ("net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Remove default QoS group and attach vports directly to root TSARCarolina Jubran2-69/+33
Currently, the driver creates a default group (`node0`) and attaches all vports to it unless the user explicitly sets a parent group. As a result, when a user configures tx_share on a group and tx_share on a VF, the expectation is for the group and the VF to share bandwidth relatively. However, since the VF is not connected to the same parent (but to the default node), the proportional share logic is not applied correctly. To fix this, remove the default group (`node0`) and instead connect vports directly to the root TSAR when no parent is specified. This ensures that vports and groups share the same root scheduler and their tx_share values are compared directly under the same hierarchy. Fixes: 0fe132eac38c ("net/mlx5: E-switch, Allow to add vports to rate groups") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Base ECVF devlink port attrs from 0Daniel Jurgens1-1/+3
Adjust the vport number by the base ECVF vport number so the port attributes start at 0. Previously the port attributes would start 1 after the maximum number of host VFs. Fixes: dc13180824b7 ("net/mlx5: Enable devlink port for embedded cpu VF vports") Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net: pse-pd: pd692x0: Skip power budget configuration when undefinedKory Maincent1-0/+4
If the power supply's power budget is not defined in the device tree, the current code still requests power and configures the PSE manager with a 0W power limit, which is undesirable behavior. Skip power budget configuration entirely when the budget is zero, avoiding unnecessary power requests and preventing invalid 0W limits from being set on the PSE manager. Fixes: 359754013e6a ("net: pse-pd: pd692x0: Add support for PSE PI priority feature") Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250820133321.841054-1-kory.maincent@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net: pse-pd: pd692x0: Fix power budget leak in manager setup error pathKory Maincent1-15/+44
Fix a resource leak where manager power budgets were freed on both success and error paths during manager setup. Power budgets should only be freed on error paths after regulator registration or during driver removal. Refactor cleanup logic by extracting OF node cleanup and power budget freeing into separate helper functions for better maintainability. Fixes: 359754013e6a ("net: pse-pd: pd692x0: Add support for PSE PI priority feature") Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250820132708.837255-1-kory.maincent@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21Octeontx2-af: Skip overlap check for SPI fieldHariprasad Kelam1-2/+2
Octeontx2/CN10K silicon supports generating a 256-bit key per packet. The specific fields to be extracted from a packet for key generation are configurable via a Key Extraction (MKEX) Profile. The AF driver scans the configured extraction profile to ensure that fields from upper layers do not overwrite fields from lower layers in the key. Example Packet Field Layout: LA: DMAC + SMAC LB: VLAN LC: IPv4/IPv6 LD: TCP/UDP Valid MKEX Profile Configuration: LA -> DMAC -> key_offset[0-5] LC -> SIP -> key_offset[20-23] LD -> SPORT -> key_offset[30-31] Invalid MKEX profile configuration: LA -> DMAC -> key_offset[0-5] LC -> SIP -> key_offset[20-23] LD -> SPORT -> key_offset[2-3] // Overlaps with DMAC field In another scenario, if the MKEX profile is configured to extract the SPI field from both AH and ESP headers at the same key offset, the driver rejecting this configuration. In a regular traffic, ipsec packet will be having either AH(LD) or ESP (LE). This patch relaxes the check for the same. Fixes: 12aa0a3b93f3 ("octeontx2-af: Harden rule validation.") Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Link: https://patch.msgid.link/20250820063919.1463518-1-hkelam@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21selftests: tls: add tests for zero-length recordsJakub Kicinski1-5/+295
Test various combinations of zero-length records. Unfortunately, kernel cannot be coerced into producing those, so hardcode the ciphertext messages in the test. Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250820021952.143068-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21tls: fix handling of zero-length records on the rx_listJakub Kicinski1-1/+6
Each recvmsg() call must process either - only contiguous DATA records (any number of them) - one non-DATA record If the next record has different type than what has already been processed we break out of the main processing loop. If the record has already been decrypted (which may be the case for TLS 1.3 where we don't know type until decryption) we queue the pending record to the rx_list. Next recvmsg() will pick it up from there. Queuing the skb to rx_list after zero-copy decrypt is not possible, since in that case we decrypted directly to the user space buffer, and we don't have an skb to queue (darg.skb points to the ciphertext skb for access to metadata like length). Only data records are allowed zero-copy, and we break the processing loop after each non-data record. So we should never zero-copy and then find out that the record type has changed. The corner case we missed is when the initial record comes from rx_list, and it's zero length. Reported-by: Muhammad Alifa Ramdhan <ramdhan@starlabs.sg> Reported-by: Billy Jheng Bing-Jhong <billy@starlabs.sg> Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250820021952.143068-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21block: avoid cpu_hotplug_lock depedency on freeze_lockNilay Shroff4-28/+37
A recent lockdep[1] splat observed while running blktest block/005 reveals a potential deadlock caused by the cpu_hotplug_lock dependency on ->freeze_lock. This dependency was introduced by commit 033b667a823e ("block: blk-rq-qos: guard rq-qos helpers by static key"). That change added a static key to avoid fetching q->rq_qos when neither blk-wbt nor blk-iolatency is configured. The static key dynamically patches kernel text to a NOP when disabled, eliminating overhead of fetching q->rq_qos in the I/O hot path. However, enabling a static key at runtime requires acquiring both cpu_hotplug_lock and jump_label_mutex. When this happens after the queue has already been frozen (i.e., while holding ->freeze_lock), it creates a locking dependency from cpu_hotplug_lock to ->freeze_lock, which leads to a potential deadlock reported by lockdep [1]. To resolve this, replace the static key mechanism with q->queue_flags: QUEUE_FLAG_QOS_ENABLED. This flag is evaluated in the fast path before accessing q->rq_qos. If the flag is set, we proceed to fetch q->rq_qos; otherwise, the access is skipped. Since q->queue_flags is commonly accessed in IO hotpath and resides in the first cacheline of struct request_queue, checking it imposes minimal overhead while eliminating the deadlock risk. This change avoids the lockdep splat without introducing performance regressions. [1] https://lore.kernel.org/linux-block/4fdm37so3o4xricdgfosgmohn63aa7wj3ua4e5vpihoamwg3ui@fq42f5q5t5ic/ Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/linux-block/4fdm37so3o4xricdgfosgmohn63aa7wj3ua4e5vpihoamwg3ui@fq42f5q5t5ic/ Fixes: 033b667a823e ("block: blk-rq-qos: guard rq-qos helpers by static key") Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250814082612.500845-4-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21block: decrement block_rq_qos static key in rq_qos_del()Nilay Shroff1-0/+1
rq_qos_add() increments the block_rq_qos static key when a QoS policy is attached. When a QoS policy is removed via rq_qos_del(), we must symmetrically decrement the static key. If this removal drops the last QoS policy from the queue (q->rq_qos becomes NULL), the static branch can be disabled and the jump label patched to a NOP, avoiding overhead on the hot path. This change ensures rq_qos_add()/rq_qos_del() keep the block_rq_qos static key balanced and prevents leaving the branch permanently enabled after the last policy is removed. Fixes: 033b667a823e ("block: blk-rq-qos: guard rq-qos helpers by static key") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250814082612.500845-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21block: skip q->rq_qos check in rq_qos_done_bio()Nilay Shroff1-2/+8
If a bio has BIO_QOS_THROTTLED or BIO_QOS_MERGED set, it implicitly guarantees that q->rq_qos is present. Avoid re-checking q->rq_qos in this case and call __rq_qos_done_bio() directly as a minor optimization. Suggested-by : Yu Kuai <yukuai1@huaweicloud.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250814082612.500845-2-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21spi: st: fix PM macros to use CONFIG_PM instead of CONFIG_PM_SLEEPRaphael Gallais-Pou1-5/+5
pm_sleep_ptr() depends on CONFIG_PM_SLEEP while pm_ptr() depends on CONFIG_PM. Since ST SSC4 implements runtime PM it makes sense using pm_ptr() here. For the same reason replace PM macros that use CONFIG_PM. Doing so prevents from using __maybe_unused attribute of runtime PM functions. Link: https://lore.kernel.org/lkml/CAMuHMdX9nkROkAJJ5odv4qOWe0bFTmaFs=Rfxsfuc9+DT-bsEQ@mail.gmail.com Fixes: 6f8584a4826f ("spi: st: Switch from CONFIG_PM_SLEEP guards to pm_sleep_ptr()") Signed-off-by: Raphael Gallais-Pou <rgallaispou@gmail.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://patch.msgid.link/20250820180310.9605-1-rgallaispou@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>