aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2025-09-30net/mlx5e: Prevent entering switchdev mode with inconsistent netnsJianbo Liu1-0/+33
When a PF enters switchdev mode, its netdevice becomes the uplink representor but remains in its current network namespace. All other representors (VFs, SFs) are created in the netns of the devlink instance. If the PF's netns has been moved and differs from the devlink's netns, enabling switchdev mode would create a state where the OVS control plane (ovs-vsctl) cannot manage the switch because the PF uplink representor and the other representors are split across different namespaces. To prevent this inconsistent configuration, block the request to enter switchdev mode if the PF netdevice's netns does not match the netns of its devlink instance. As part of this change, the PF's netns is first marked as immutable. This prevents race conditions where the netns could be changed after the check is performed but before the mode transition is complete, and it aligns the PF's behavior with that of the final uplink representor. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1759094723-843774-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-30net/mlx5: HWS, Generalize complex matchersVlad Dogaru6-1199/+836
The existing solution of complex matchers splits the match parameters across two, and exactly two, matchers. For some rather extreme cases (e.g. IPv6-in-IPv6 tunnels), even two matchers are not enough. Generalize complex matchers to up to 4 submatchers, and allow easy extension to more if needed. This resulted in rewriting a large part of the high-level complex matchers logic, but the original concepts were rock solid and still hold. Key characteristics of the new implementation: * Rework complex matchers to include multiple submatchers. All submatchers but the first are isolated, in keeping with the existing paradigm of handing off to specialized matchers that are not otherwise reachable by regular rules. * Similarly, rework complex rules to allow splitting them into more than two simple rules. Rules continue to be refcounted to allow for multiple complex rules matching on identical parts of the match params. * Rely on the match tag, as opposed to the entire match_param, to hash subrules. This results in lower memory usage. * Prefer to split the original user-supplied match parameters rather than the internal field descriptors. This avoids the awkward transition back and forth between the two formats. * Allow splitting multi-dword fields across matchers. The only restrictions that the new implementation impose are: a) any fragment of an IP address must be accompanied by a match on the IP version; and b) a single lower dword of an IPv6 address cannot be present in a submatcher as it would be interpreted as an IPv4 address. * Employ a greedy algorithm to split the match params, as opposed to complete search. The results are not optimal, but the algorithm is now linear compared to exponential. Consequently, we see complex matcher creation time drops two orders of magnitude in our tests. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1759094723-843774-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-01Drivers: hv: Make CONFIG_HYPERV boolMukesh Rathor3-3/+3
With CONFIG_HYPERV and CONFIG_HYPERV_VMBUS separated, change CONFIG_HYPERV to bool from tristate. CONFIG_HYPERV now becomes the core Hyper-V hypervisor support, such as hypercalls, clocks/timers, Confidential Computing setup, PCI passthru, etc. that doesn't involve VMBus or VMBus devices. Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-10-01Drivers: hv: Add CONFIG_HYPERV_VMBUS optionMukesh Rathor12-16/+25
At present VMBus driver is hinged off of CONFIG_HYPERV which entails lot of builtin code and encompasses too much. It's not always clear what depends on builtin hv code and what depends on VMBus. Setting CONFIG_HYPERV as a module and fudging the Makefile to switch to builtin adds even more confusion. VMBus is an independent module and should have its own config option. Also, there are scenarios like baremetal dom0/root where support is built in with CONFIG_HYPERV but without VMBus. Lastly, there are more features coming down that use CONFIG_HYPERV and add more dependencies on it. So, create a fine grained HYPERV_VMBUS option and update Kconfigs for dependency on VMBus. Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> # drivers/pci Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30Merge tag 'timers-vdso-2025-09-29' of ↵Linus Torvalds30-285/+91
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull VDSO updates from Thomas Gleixner: - Further consolidation of the VDSO infrastructure and the common data store - Simplification of the related Kconfig logic - Improve the VDSO selftest suite * tag 'timers-vdso-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftests: vDSO: Drop vdso_test_clock_getres selftests: vDSO: vdso_test_abi: Add tests for clock_gettime64() selftests: vDSO: vdso_test_abi: Test CPUTIME clocks selftests: vDSO: vdso_test_abi: Use explicit indices for name array selftests: vDSO: vdso_test_abi: Drop clock availability tests selftests: vDSO: vdso_test_abi: Use ksft_finished() selftests: vDSO: vdso_test_abi: Correctly skip whole test with missing vDSO selftests: vDSO: Fix -Wunitialized in powerpc VDSO_CALL() wrapper vdso: Add struct __kernel_old_timeval forward declaration to gettime.h vdso: Gate VDSO_GETRANDOM behind HAVE_GENERIC_VDSO vdso: Drop Kconfig GENERIC_VDSO_TIME_NS vdso: Drop Kconfig GENERIC_VDSO_DATA_STORE vdso: Drop kconfig GENERIC_COMPAT_VDSO vdso: Drop kconfig GENERIC_VDSO_32 riscv: vdso: Untangle Kconfig logic time: Build generic update_vsyscall() only with generic time vDSO vdso/gettimeofday: Remove !CONFIG_TIME_NS stubs vdso: Move ENABLE_COMPAT_VDSO from core to arm64 ARM: VDSO: Remove cntvct_ok global variable vdso/datastore: Gate time data behind CONFIG_GENERIC_GETTIMEOFDAY
2025-09-30Merge tag 'timers-clocksource-2025-09-29' of ↵Linus Torvalds32-957/+1386
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull clocksource updates from Thomas Gleixner: - Further preparations for modular clocksource/event drivers - The usual device tree updates to support new chip variants and the related changes to thise drivers - Avoid a 64-bit division in the TEGRA186 driver, which caused a build fail on 32-bit machines. - Small fixes, improvements and cleanups all over the place * tag 'timers-clocksource-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) dt-bindings: timer: exynos4210-mct: Add compatible for ARTPEC-9 SoC clocksource/drivers/sh_cmt: Split start/stop of clock source and events clocksource/drivers/clps711x: Fix resource leaks in error paths clocksource/drivers/arm_global_timer: Add auto-detection for initial prescaler values clocksource/drivers/ingenic-sysost: Convert from round_rate() to determine_rate() clocksource/drivers/timer-tegra186: Don't print superfluous errors clocksource/drivers/timer-rtl-otto: Simplify documentation clocksource/drivers/timer-rtl-otto: Do not interfere with interrupts clocksource/drivers/timer-rtl-otto: Drop set_counter function clocksource/drivers/timer-rtl-otto: Work around dying timers clocksource/drivers/timer-ti-dm : Capture functionality for OMAP DM timer clocksource/drivers/arm_arch_timer_mmio: Add MMIO clocksource clocksource/drivers/arm_arch_timer_mmio: Switch over to standalone driver clocksource/drivers/arm_arch_timer: Add standalone MMIO driver ACPI: GTDT: Generate platform devices for MMIO timers clocksource/drivers/nxp-pit: Add NXP Automotive s32g2 / s32g3 support dt: bindings: fsl,vf610-pit: Add compatible for s32g2 and s32g3 clocksource/drivers/vf-pit: Rename the VF PIT to NXP PIT clocksource/drivers/vf-pit: Unify the function name for irq ack clocksource/drivers/vf-pit: Consolidate calls to pit_*_disable/enable ...
2025-09-30Drivers: hv: vmbus: Fix typos in vmbus_drv.cAlok Tiwari1-2/+2
Fix two minor typos in vmbus_drv.c: - Correct "reponsible" -> "responsible" in a comment. - Add missing newline in pr_err() message ("channeln" -> "channel\n"). These are cosmetic changes only and do not affect functionality. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30Drivers: hv: vmbus: Fix sysfs output format for ring buffer indexAlok Tiwari1-2/+2
The sysfs attributes out_read_index and out_write_index in vmbus_drv.c currently use %d to print outbound.current_read_index and outbound.current_write_index. These fields are u32 values, so printing them with %d (signed) is not logically correct. Update the format specifier to %u to correctly match their type. No functional change, only fixes the sysfs output format. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store()Alok Tiwari1-1/+1
The target_cpu_store() function parses the target CPU from the sysfs buffer using sscanf(). The format string currently uses "%uu", which is redundant. The compiler ignores the extra "u", so there is no incorrect parsing at runtime. Update the format string to use "%u" for clarity and consistency. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30Merge tag 'timers-core-2025-09-29' of ↵Linus Torvalds19-62/+54
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: - Address the inconsistent shutdown sequence of per CPU clockevents on CPU hotplug, which only removed it from the core but failed to invoke the actual device driver shutdown callback. This kept the timer active, which prevented power savings and caused pointless noise in virtualization. - Encapsulate the open coded access to the hrtimer clock base, which is a private implementation detail, so that the implementation can be changed without breaking a lot of usage sites. - Enhance the debug output of the clocksource watchdog to provide better information for analysis. - The usual set of cleanups and enhancements all over the place * tag 'timers-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Fix spelling mistakes in comments clocksource: Print durations for sync check unconditionally LoongArch: Remove clockevents shutdown call on offlining tick: Do not set device to detached state in tick_shutdown() hrtimer: Reorder branches in hrtimer_clockid_to_base() hrtimer: Remove hrtimer_clock_base:: Get_time hrtimer: Use hrtimer_cb_get_time() helper media: pwm-ir-tx: Avoid direct access to hrtimer clockbase ALSA: hrtimer: Avoid direct access to hrtimer clockbase lib: test_objpool: Avoid direct access to hrtimer clockbase sched/core: Avoid direct access to hrtimer clockbase timers/itimer: Avoid direct access to hrtimer clockbase posix-timers: Avoid direct access to hrtimer clockbase jiffies: Remove obsolete SHIFTED_HZ comment
2025-09-30Merge tag 'locking-futex-2025-09-29' of ↵Linus Torvalds20-1090/+575
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex updates from Thomas Gleixner: "A set of updates for futexes and related selftests: - Plug the ptrace_may_access() race against a concurrent exec() which allows to pass the check before the target's process transition in exec() by taking a read lock on signal->ext_update_lock. - A large set of cleanups and enhancement to the futex selftests. The bulk of changes is the conversion to the kselftest harness" * tag 'locking-futex-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) selftest/futex: Fix spelling mistake "boundarie" -> "boundary" selftests/futex: Remove logging.h file selftests/futex: Drop logging.h include from futex_numa selftests/futex: Refactor futex_numa_mpol with kselftest_harness.h selftests/futex: Refactor futex_priv_hash with kselftest_harness.h selftests/futex: Refactor futex_waitv with kselftest_harness.h selftests/futex: Refactor futex_requeue with kselftest_harness.h selftests/futex: Refactor futex_wait with kselftest_harness.h selftests/futex: Refactor futex_wait_private_mapped_file with kselftest_harness.h selftests/futex: Refactor futex_wait_unitialized_heap with kselftest_harness.h selftests/futex: Refactor futex_wait_wouldblock with kselftest_harness.h selftests/futex: Refactor futex_wait_timeout with kselftest_harness.h selftests/futex: Refactor futex_requeue_pi_signal_restart with kselftest_harness.h selftests/futex: Refactor futex_requeue_pi_mismatched_ops with kselftest_harness.h selftests/futex: Refactor futex_requeue_pi with kselftest_harness.h selftests: kselftest: Create ksft_print_dbg_msg() futex: Don't leak robust_list pointer on exec race selftest/futex: Compile also with libnuma < 2.0.16 selftest/futex: Reintroduce "Memory out of range" numa_mpol's subtest selftest/futex: Make the error check more precise for futex_numa_mpol ...
2025-09-30Merge tag 'smp-core-2025-09-29' of ↵Linus Torvalds1-6/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp doc fixlet from Thomas Gleixner: "An update of the stale smp_call_function_many() documentation to bring it back in sync with the actual implementation" * tag 'smp-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smp: Fix up and expand the smp_call_function_many() kerneldoc
2025-09-30Merge tag 'irq-drivers-2025-09-29' of ↵Linus Torvalds20-156/+399
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq chip driver updates from Thomas Gleixner: - Use the startup/shutdown callbacks for the PCI/MSI per device interrupt domains. This allows us to initialize the RISCV PLIC interrupt hierarchy correctly and provides a mechanism to decouple the masking and unmasking during run-time from the expensive PCI mask and unmask when the underlying MSI provider implementation allows the interrupt to be masked. - Initialize the RISCV PLIC MSI interrupt hierarchy correctly so that the affinity assignment works correctly by switching it over to the startup/shutdown scheme - Allow MSI providers to opt out from masking a PCI/MSI interrupt at the PCI device during operation when the provider can mask the interrupt at the underlying interrupt chip. This reduces the overhead in scenarios where disable_irq()/enable_irq() is utilized frequently by a driver. The PCI/MSI device level [un]masking is only required on startup and shutdown in this case. - Remove the conditional mask/unmask logic in the PCI/MSI layer as this is now handled unconditionally. - Replace the hardcoded interrupt routing in the Loongson EIOINTC interrupt driver to respect the firmware settings and spread them out to different CPU interrupt inputs so that the demultiplexing handler only needs to read only a single 64-bit status register instead of four, which significantly reduces the overhead in VMs as the status register access causes a VM exit. - Add support for the new AST2700 SCU interrupt controllers - Use the legacy interrupt domain setup for the Loongson PCH-LPC interrupt controller, which resembles the x86 legacy PIC setup and has the same hardcoded legacy requirements. - The usual set of cleanups, fixes and improvements all over the place * tag 'irq-drivers-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) irqchip/loongson-pch-lpc: Use legacy domain for PCH-LPC IRQ controller PCI/MSI: Remove the conditional parent [un]mask logic irqchip/msi-lib: Honor the MSI_FLAG_PCI_MSI_MASK_PARENT flag irqchip/aspeed-scu-ic: Add support for AST2700 SCU interrupt controllers dt-bindings: interrupt-controller: aspeed: Add AST2700 SCU IC compatibles dt-bindings: mfd: aspeed: Add AST2700 SCU compatibles irqchip/aspeed-scu-ic: Refactor driver to support variant-based initialization irqchip/gic-v5: Fix error handling in gicv5_its_irq_domain_alloc() irqchip/gic-v5: Fix loop in gicv5_its_create_itt_two_level() cleanup path irqchip/gic-v5: Delete a stray tab irqchip/sg2042-msi: Set irq type according to DT configuration riscv: sophgo: dts: sg2044: Change msi irq type to IRQ_TYPE_EDGE_RISING riscv: sophgo: dts: sg2042: Change msi irq type to IRQ_TYPE_EDGE_RISING irqchip/gic-v2m: Handle Multiple MSI base IRQ Alignment irqchip/renesas-rzg2l: Remove dev_err_probe() if error is -ENOMEM irqchip: Use int type to store negative error codes irqchip/gic-v5: Remove the redundant ITS cache invalidation PCI/MSI: Check MSI_FLAG_PCI_MSI_MASK_PARENT in cond_[startup|shutdown]_parent() irqchip/loongson-eiointc: Add multiple interrupt pin routing support irqchip/loongson-eiointc: Route interrupt parsed from bios table ...
2025-09-30Merge tag 'irq-core-2025-09-29' of ↵Linus Torvalds10-106/+344
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq core updates from Thomas Gleixner: "A set of updates for the interrupt core subsystem: - Introduce irq_chip_[startup|shutdown]_parent() to prepare for addressing a few short comings in the PCI/MSI interrupt subsystem. It allows to utilize the interrupt chip startup/shutdown callbacks for initializing the interrupt chip hierarchy properly on certain RISCV implementations and provides a mechanism to reduce the overhead of masking and unmasking PCI/MSI interrupts during operation when the underlying MSI provider can mask the interrupt. The actual usage comes with the interrupt driver pull request. - Add generic error handling for devm_request_*_irq() This allows to remove the zoo of random error printk's all over the usage sites. - Add a mechanism to warn about long-running interrupt handlers Long running interrupt handlers can introduce latencies and tracking them down is a tedious task. The tracking has to be enabled with a threshold on the kernel command line and utilizes a static branch to remove the overhead when disabled. - Update and extend the selftests which validate the CPU hotplug interrupt migration logic - Allow dropping the per CPU softirq lock on PREEMPT_RT kernels, which causes contention and latencies all over the place. The serialization requirements have been pushed down into the actual affected usage sites already. - The usual small cleanups and improvements" * tag 'irq-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT softirq: Provide a handshake for canceling tasklets via polling genirq/test: Ensure CPU 1 is online for hotplug test genirq/test: Drop CONFIG_GENERIC_IRQ_MIGRATION assumptions genirq/test: Depend on SPARSE_IRQ genirq/test: Fail early if interrupt request fails genirq/test: Factor out fake-virq setup genirq/test: Select IRQ_DOMAIN genirq/test: Fix depth tests on architectures with NOREQUEST by default. genirq: Add support for warning on long-running interrupt handlers genirq/devres: Add error handling in devm_request_*_irq() genirq: Add irq_chip_(startup/shutdown)_parent() genirq: Remove GENERIC_IRQ_LEGACY
2025-09-30x86/hyperv: Switch to msi_create_parent_irq_domain()Nam Cao2-35/+77
Move away from the legacy MSI domain setup, switch to use msi_create_parent_irq_domain(). While doing the conversion, I noticed that hv_irq_compose_msi_msg() is doing more than it is supposed to (composing message content). The interrupt allocation bits should be moved into hv_msi_domain_alloc(). However, I have no hardware to test this change, therefore I leave a TODO note. Signed-off-by: Nam Cao <namcao@linutronix.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30mshv: Use common "entry virt" APIs to do work in root before running guestSean Christopherson4-50/+7
Use the kernel's common "entry virt" APIs to handle pending work prior to (re)entering guest mode, now that the virt APIs don't have a superfluous dependency on KVM. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30entry: Rename "kvm" entry code assets to "virt" to genericize APIsSean Christopherson12-19/+19
Rename the "kvm" entry code files and Kconfigs to use generic "virt" nomenclature so that the code can be reused by other hypervisors (or rather, their root/dom0 partition drivers), without incorrectly suggesting the code somehow relies on and/or involves KVM. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM properSean Christopherson8-26/+24
Move KVM's morphing of pending signals into userspace exits into KVM proper, and drop the @vcpu param from xfer_to_guest_mode_handle_work(). How KVM responds to -EINTR is a detail that really belongs in KVM itself, and invoking kvm_handle_signal_exit() from kernel code creates an inverted module dependency. E.g. attempting to move kvm_handle_signal_exit() into kvm_main.c would generate an linker error when building kvm.ko as a module. Dropping KVM details will also converting the KVM "entry" code into a more generic virtualization framework so that it can be used when running as a Hyper-V root partition. Lastly, eliminating usage of "struct kvm_vcpu" outside of KVM is also nice to have for KVM x86 developers, as keeping the details of kvm_vcpu purely within KVM allows changing the layout of the structure without having to boot into a new kernel, e.g. allows rebuilding and reloading kvm.ko with a modified kvm_vcpu structure as part of debug/development. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30mshv: Handle NEED_RESCHED_LAZY before transferring to guestSean Christopherson2-2/+3
Check for NEED_RESCHED_LAZY, not just NEED_RESCHED, prior to transferring control to a guest. Failure to check for lazy resched can unnecessarily delay rescheduling until the next tick when using a lazy preemption model. Note, ideally both the checking and processing of TIF bits would be handled in common code, to avoid having to keep three separate paths synchronized, but defer such cleanups to the future to keep the fix as standalone as possible. Cc: Nuno Das Neves <nunodasneves@linux.microsoft.com> Cc: Mukesh R <mrathor@linux.microsoft.com> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs") Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30x86/hyperv: Add kexec/kdump support on Azure CVMsVitaly Kuznetsov1-1/+210
Azure CVM instance types featuring a paravisor hang upon kdump. The investigation shows that makedumpfile causes a hang when it steps on a page which was previously share with the host (HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY). The new kernel has no knowledge of these 'special' regions (which are Vmbus connection pages, GPADL buffers, ...). There are several ways to approach the issue: - Convey the knowledge about these regions to the new kernel somehow. - Unshare these regions before accessing in the new kernel (it is unclear if there's a way to query the status for a given GPA range). - Unshare these regions before jumping to the new kernel (which this patch implements). To make the procedure as robust as possible, store PFN ranges of shared regions in a linked list instead of storing GVAs and re-using hv_vtom_set_host_visibility(). This also allows to avoid memory allocation on the kdump/kexec path. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUsPatrisious Haddad1-2/+26
Write combining is an optimization feature in CPUs that is frequently used by modern devices to generate 32 or 64 byte TLPs at the PCIe level. These large TLPs allow certain optimizations in the driver to HW communication that improve performance. As WC is unpredictable and optional the HW designs all tolerate cases where combining doesn't happen and simply experience a performance degradation. Unfortunately many virtualization environments on all architectures have done things that completely disable WC inside the VM with no generic way to detect this. For example WC was fully blocked in ARM64 KVM until commit 8c47ce3e1d2c ("KVM: arm64: Set io memory s2 pte as normalnc for vfio pci device"). Trying to use WC when it is known not to work has a measurable performance cost (~5%). Long ago mlx5 developed an boot time algorithm to test if WC is available or not by using unique mlx5 HW features to measure how many large TLPs the device is receiving. The SW generates a large number of combining opportunities and if any succeed then WC is declared working. In mlx5 the WC optimization feature is never used by the kernel except for the boot time test. The WC is only used by userspace in rdma-core. Sadly modern ARM CPUs, especially NVIDIA Grace, have a combining implementation that is very unreliable compared to pretty much everything prior. This is being fixed architecturally in new CPUs with a new ST64B instruction, but current shipping devices suffer this problem. Unreliable means the SW can present thousands of combining opportunities and the HW will not combine for any of them, which creates a performance degradation, and critically fails the mlx5 boot test. However, the CPU is very sensitive to the instruction sequence used, with the better options being sufficiently good that the performance loss from the unreliable CPU is not measurable. Broadly there are several options, from worst to best: 1) A C loop doing a u64 memcpy. This was used prior to commit ef302283ddfc ("IB/mlx5: Use __iowrite64_copy() for write combining stores") and failed almost all the time on Grace CPUs. 2) ARM64 assembly with consecutive 8 byte stores. This was implemented as an arch-generic __iowriteXX_copy() family of functions suitable for performance use in drivers for WC. commit ead79118dae6 ("arm64/io: Provide a WC friendly __iowriteXX_copy()") provided the ARM implementation. 3) ARM64 assembly with consecutive 16 byte stores. This was rejected from kernel use over fears of virtualization failures. Common ARM VMMs will crash if STP is used against emulated memory. 4) A single NEON store instruction. Userspace has used this option for a very long time, it performs well. 5) For future silicon the new ST64B instruction is guaranteed to generate a 64 byte TLP 100% of the time The past upgrade from #1 to #2 was thought to be sufficient to solve this problem. However, more testing on more systems shows that #3 is still problematic at a low frequency and the kernel test fails. Thus, make the mlx5 use the same instructions as userspace during the boot time WC self test. This way the WC test matches the userspace and will properly detect the ability of HW to support the WC workload that userspace will generate. While #4 still has imperfect combining performance, it is substantially better than #2, and does actually give a performance win to applications. Self-test failures with #2 are like 3/10 boots, on some systems, #4 has never seen a boot failure. There is no real general use case for a NEON based WC flow in the kernel. This is not suitable for any performance path work as getting into/out of a NEON context is fairly expensive compared to the gain of WC. Future CPUs are going to fix this issue by using an new ARM instruction and __iowriteXX_copy() will be updated to use that automatically, probably using the ALTERNATES mechanism. Since this problem is constrained to mlx5's unique situation of needing a non-performance code path to duplicate what mlx5 userspace is doing as a matter of self-testing, implement it as a one line inline assembly in the driver directly. Lastly, this was concluded from the discussion with ARM maintainers which confirms that this is the best approach for the solution: https://lore.kernel.org/r/aHqN_hpJl84T1Usi@arm.com Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1759093688-841357-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-30Merge tag 'core-rseq-2025-09-29' of ↵Linus Torvalds3-12/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq updates from Thomas Gleixner: "Two fixes for RSEQ: - Protect the event mask modification against the membarrier() IPI as otherwise the RmW operation is unprotected and events might be lost - Fix the weak symbol reference in rseq selftests The current weak RSEQ symbols definitions which were added to allow static linkage are not working correctly as they effectively re-define the glibc symbols leading to multiple versions of the symbols when compiled with -fno-common. Mark them as 'extern' to convert them from weak symbol definitions to weak symbol references. That works with static and dynamic linkage independent of -fcommon and -fno-common" * tag 'core-rseq-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq/selftests: Use weak symbol reference, not definition, to link with glibc rseq: Protect event mask against membarrier IPI
2025-09-30Merge tag 'core-core-2025-09-29' of ↵Linus Torvalds10-139/+150
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull TIF bit unification updates from Thomas Gleixner: "A set of changes to consolidate the generic TIF (thread info flag) bits accross architectures. All architectures define the same set of generic TIF bits. This makes it pointlessly hard to add a new generic TIF bit or to change an existing one. Provide a generic variant and convert the architectures which utilize the generic entry code over to use it. The TIF space is divided into 16 generic bits and 16 architecture specific bits, which turned out to provide enough space on both sides" * tag 'core-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: LoongArch: Fix bitflag conflict for TIF_FIXADE riscv: Use generic TIF bits loongarch: Use generic TIF bits s390/entry: Remove unused TIF flags s390: Use generic TIF bits x86: Use generic TIF bits asm-generic: Provide generic TIF infrastructure
2025-09-30tracing: Ensure optimized hashing worksMichal Koutný1-0/+2
If ever PID_MAX_DEFAULT changes, it must be compatible with tracing hashmaps assumptions. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250924113810.2433478-1-mkoutny@suse.com Link: https://lore.kernel.org/r/20240409110126.651e94cb@gandalf.local.home/ Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-30ftrace: Fix softlockup in ftrace_module_enableVladimir Riabchun1-0/+2
A soft lockup was observed when loading amdgpu module. If a module has a lot of tracable functions, multiple calls to kallsyms_lookup can spend too much time in RCU critical section and with disabled preemption, causing kernel panic. This is the same issue that was fixed in commit d0b24b4e91fc ("ftrace: Prevent RCU stall on PREEMPT_VOLUNTARY kernels") and commit 42ea22e754ba ("ftrace: Add cond_resched() to ftrace_graph_set_hash()"). Fix it the same way by adding cond_resched() in ftrace_module_enable. Link: https://lore.kernel.org/aMQD9_lxYmphT-up@vova-pc Signed-off-by: Vladimir Riabchun <ferr.lambarginio@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-30fbdev: simplefb: Fix use after free in simplefb_detach_genpds()Janne Grunau1-8/+23
The pm_domain cleanup can not be devres managed as it uses struct simplefb_par which is allocated within struct fb_info by framebuffer_alloc(). This allocation is explicitly freed by unregister_framebuffer() in simplefb_remove(). Devres managed cleanup runs after the device remove call and thus can no longer access struct simplefb_par. Call simplefb_detach_genpds() explicitly from simplefb_destroy() like the cleanup functions for clocks and regulators. Fixes an use after free on M2 Mac mini during aperture_remove_conflicting_devices() using the downstream asahi kernel with Debian's kernel config. For unknown reasons this started to consistently dereference an invalid pointer in v6.16.3 based kernels. [ 6.736134] BUG: KASAN: slab-use-after-free in simplefb_detach_genpds+0x58/0x220 [ 6.743545] Read of size 4 at addr ffff8000304743f0 by task (udev-worker)/227 [ 6.750697] [ 6.752182] CPU: 6 UID: 0 PID: 227 Comm: (udev-worker) Tainted: G S 6.16.3-asahi+ #16 PREEMPTLAZY [ 6.752186] Tainted: [S]=CPU_OUT_OF_SPEC [ 6.752187] Hardware name: Apple Mac mini (M2, 2023) (DT) [ 6.752189] Call trace: [ 6.752190] show_stack+0x34/0x98 (C) [ 6.752194] dump_stack_lvl+0x60/0x80 [ 6.752197] print_report+0x17c/0x4d8 [ 6.752201] kasan_report+0xb4/0x100 [ 6.752206] __asan_report_load4_noabort+0x20/0x30 [ 6.752209] simplefb_detach_genpds+0x58/0x220 [ 6.752213] devm_action_release+0x50/0x98 [ 6.752216] release_nodes+0xd0/0x2c8 [ 6.752219] devres_release_all+0xfc/0x178 [ 6.752221] device_unbind_cleanup+0x28/0x168 [ 6.752224] device_release_driver_internal+0x34c/0x470 [ 6.752228] device_release_driver+0x20/0x38 [ 6.752231] bus_remove_device+0x1b0/0x380 [ 6.752234] device_del+0x314/0x820 [ 6.752238] platform_device_del+0x3c/0x1e8 [ 6.752242] platform_device_unregister+0x20/0x50 [ 6.752246] aperture_detach_platform_device+0x1c/0x30 [ 6.752250] aperture_detach_devices+0x16c/0x290 [ 6.752253] aperture_remove_conflicting_devices+0x34/0x50 ... [ 6.752343] [ 6.967409] Allocated by task 62: [ 6.970724] kasan_save_stack+0x3c/0x70 [ 6.974560] kasan_save_track+0x20/0x40 [ 6.978397] kasan_save_alloc_info+0x40/0x58 [ 6.982670] __kasan_kmalloc+0xd4/0xd8 [ 6.986420] __kmalloc_noprof+0x194/0x540 [ 6.990432] framebuffer_alloc+0xc8/0x130 [ 6.994444] simplefb_probe+0x258/0x2378 ... [ 7.054356] [ 7.055838] Freed by task 227: [ 7.058891] kasan_save_stack+0x3c/0x70 [ 7.062727] kasan_save_track+0x20/0x40 [ 7.066565] kasan_save_free_info+0x4c/0x80 [ 7.070751] __kasan_slab_free+0x6c/0xa0 [ 7.074675] kfree+0x10c/0x380 [ 7.077727] framebuffer_release+0x5c/0x90 [ 7.081826] simplefb_destroy+0x1b4/0x2c0 [ 7.085837] put_fb_info+0x98/0x100 [ 7.089326] unregister_framebuffer+0x178/0x320 [ 7.093861] simplefb_remove+0x3c/0x60 [ 7.097611] platform_remove+0x60/0x98 [ 7.101361] device_remove+0xb8/0x160 [ 7.105024] device_release_driver_internal+0x2fc/0x470 [ 7.110256] device_release_driver+0x20/0x38 [ 7.114529] bus_remove_device+0x1b0/0x380 [ 7.118628] device_del+0x314/0x820 [ 7.122116] platform_device_del+0x3c/0x1e8 [ 7.126302] platform_device_unregister+0x20/0x50 [ 7.131012] aperture_detach_platform_device+0x1c/0x30 [ 7.136157] aperture_detach_devices+0x16c/0x290 [ 7.140779] aperture_remove_conflicting_devices+0x34/0x50 ... Reported-by: Daniel Huhardeaux <tech@tootai.net> Cc: stable@vger.kernel.org Fixes: 92a511a568e44 ("fbdev/simplefb: Add support for generic power-domains") Signed-off-by: Janne Grunau <j@jannau.net> Reviewed-by: Hans de Goede <hansg@kernel.org> Signed-off-by: Helge Deller <deller@gmx.de>
2025-09-30fbdev: s3fb: Revert mclk stop in suspendZsolt Kajtar1-2/+2
There are systems which want to wait for as long as it takes for the stopped video memory to answer. Mapping it out helps to avoid that while the system is running but standby still hangs somehow. So just leave the memory on in standby same as it was before my change. Signed-off-by: Zsolt Kajtar <soci@c64.rulez.org> Signed-off-by: Helge Deller <deller@gmx.de>
2025-09-30fbdev: mb862xxfb: Use int type to store negative error codesQianfeng Rong1-1/+1
Change the 'ret' variable in of_platform_mb862xx_probe() from unsigned long to int, as it needs to store either negative error codes or zero. Storing the negative error codes in unsigned type, doesn't cause an issue at runtime but can be confusing. Additionally, assigning negative error codes to unsigned type may trigger a GCC warning when the -Wsign-conversion flag is enabled. No effect on runtime. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Helge Deller <deller@gmx.de>
2025-09-30fbdev: Use string choices helpersChelsy Ratnawat3-5/+8
Use string_choices.h helpers instead of hard-coded strings. Signed-off-by: Chelsy Ratnawat <chelsyratnawat2001@gmail.com> Signed-off-by: Helge Deller <deller@gmx.de>
2025-09-30fbdev: core: Fix ubsan warning in pixel_to_patZsolt Kajtar1-2/+1
It could be triggered on 32 bit big endian machines at 32 bpp in the pattern realignment. In this case just return early as the result is an identity. Signed-off-by: Zsolt Kajtar <soci@c64.rulez.org> Signed-off-by: Helge Deller <deller@gmx.de>
2025-09-30fbdev: s3fb: Implement 1 and 2 BPP modes, improve 4 BPPZsolt Kajtar2-36/+105
With the right setup S3 cards can display 1 and 2 BPP packed pixel modes, even in high resolutions. So this patch makes them available. The 4 BPP packed pixel mode had one pixel column of garbage on the left side due to how the shift register works, this is fixed now. There was a limitation that only 8 pixel wide fonts could be used at 4 BPP. Since the CFB routines were updated to handle reverse pixel ordering correctly that limitation doesn't exists and was removed now. In 4 BPP interleaved planes mode font widths of multiply of 8 are accepted now, not just 8 pixels. The horizontal screen position will not move as much between modes as it used to. That was caused by the various amount of pipeline delay which is compensated now as much as possible. While adjusting the code direct port access of PEL registers was corrected. Should work now on systems where these are memory mapped. I've noticed that when in 1 BPP mode the console is used with Unicode fonts erasing might be done with non-blanks. That's a bug in the VT code and so not part of this patch. Signed-off-by: Zsolt Kajtar <soci@c64.rulez.org> Signed-off-by: Helge Deller <deller@gmx.de>
2025-09-30fbdev: s3fb: Implement powersave for S3 FBZsolt Kajtar1-18/+19
This patch implements power saving for S3 cards by powering down the RAMDAC and stopping MCLK and DCLK while the card is supposed to be suspended. The RAMDAC is also disabled while the screen is blanked and the DCLK in stopped while in DPMS power off. The practical difference it makes is that on a machine with such a card the display will be placed in DPMS power off while standby is activated (due to stopped DCLK). Same like when using other cards with implemented power saving functionality. Without it on my setup the connected display powers up and stays that way showing VT63 while in standby. Sort of annoying as before standby it's specifically placed into DPMS off in Xorg for a while. The used functionality should exists for sure on Trio32 to Aurora64V (according to the documentation) so I think it's generally applicable. I'm using this on S3 Trio 3D and S3 Virge DX. Signed-off-by: Zsolt Kajtar <soci@c64.rulez.org> Signed-off-by: Helge Deller <deller@gmx.de>
2025-09-30fbdev: xenfb: Use vmalloc_array to simplify codeQianfeng Rong1-1/+1
Use vmalloc_array() instead of vmalloc() to simplify the function xenfb_probe(). Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Helge Deller <deller@gmx.de>
2025-09-30Merge tag 'x86_apic_for_v6.18_rc1' of ↵Linus Torvalds61-708/+2042
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV and apic updates from Borislav Petkov: - Add functionality to provide runtime firmware updates for the non-x86 parts of an AMD platform like the security processor (ASP) firmware, modules etc, for example. The intent being that these updates are interim, live fixups before a proper BIOS update can be attempted - Add guest support for AMD's Secure AVIC feature which gives encrypted guests the needed protection against a malicious hypervisor generating unexpected interrupts and injecting them into such guest, thus interfering with its operation in an unexpected and negative manner. The advantage of this scheme is that the guest determines which interrupts and when to accept them vs leaving that to the benevolence (or not) of the hypervisor - Strictly separate the startup code from the rest of the kernel where former is executed from the initial 1:1 mapping of memory. The problem was that the toolchain-generated version of the code was being executed from a different mapping of memory than what was "assumed" during code generation, needing an ever-growing pile of fixups for absolute memory references which are invalid in the early, 1:1 memory mapping during boot. The major advantage of this is that there's no need to check the 1:1 mapping portion of the code for absolute relocations anymore and get rid of the RIP_REL_REF() macro sprinkling all over the place. For more info, see Ard's very detailed writeup on this [1] - The usual cleanups and fixes Link: https://lore.kernel.org/r/CAMj1kXEzKEuePEiHB%2BHxvfQbFz0sTiHdn4B%2B%2BzVBJ2mhkPkQ4Q@mail.gmail.com [1] * tag 'x86_apic_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits) x86/boot: Drop erroneous __init annotation from early_set_pages_state() crypto: ccp - Add AMD Seamless Firmware Servicing (SFS) driver crypto: ccp - Add new HV-Fixed page allocation/free API x86/sev: Add new dump_rmp parameter to snp_leak_pages() API x86/startup/sev: Document the CPUID flow in the boot #VC handler objtool: Ignore __pi___cfi_ prefixed symbols x86/sev: Zap snp_abort() x86/apic/savic: Do not use snp_abort() x86/boot: Get rid of the .head.text section x86/boot: Move startup code out of __head section efistub/x86: Remap inittext read-execute when needed x86/boot: Create a confined code area for startup code x86/kbuild: Incorporate boot/startup/ via Kbuild makefile x86/boot: Revert "Reject absolute references in .head.text" x86/boot: Check startup code for absence of absolute relocations objtool: Add action to check for absence of absolute relocations x86/sev: Export startup routines for later use x86/sev: Move __sev_[get|put]_ghcb() into separate noinstr object x86/sev: Provide PIC aliases for SEV related data objects x86/boot: Provide PIC aliases for 5-level paging related constants ...
2025-09-30Merge tag 'x86_cache_for_v6.18_rc1' of ↵Linus Torvalds16-229/+2021
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 resource control updates from Borislav Petkov: "Add support on AMD for assigning QoS bandwidth counters to resources (RMIDs) with the ability for those resources to be tracked by the counters as long as they're assigned to them. Previously, due to hw limitations, bandwidth counts from untracked resources would get lost when those resources are not tracked. Refactor the code and user interfaces to be able to also support other, similar features on ARM, for example" * tag 'x86_cache_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) fs/resctrl: Fix counter auto-assignment on mkdir with mbm_event enabled MAINTAINERS: resctrl: Add myself as reviewer x86/resctrl: Configure mbm_event mode if supported fs/resctrl: Introduce the interface to switch between monitor modes fs/resctrl: Disable BMEC event configuration when mbm_event mode is enabled fs/resctrl: Introduce the interface to modify assignments in a group fs/resctrl: Introduce mbm_L3_assignments to list assignments in a group fs/resctrl: Auto assign counters on mkdir and clean up on group removal fs/resctrl: Introduce mbm_assign_on_mkdir to enable assignments on mkdir fs/resctrl: Provide interface to update the event configurations fs/resctrl: Add event configuration directory under info/L3_MON/ fs/resctrl: Support counter read/reset with mbm_event assignment mode x86/resctrl: Implement resctrl_arch_reset_cntr() and resctrl_arch_cntr_read() x86/resctrl: Refactor resctrl_arch_rmid_read() fs/resctrl: Introduce counter ID read, reset calls in mbm_event mode fs/resctrl: Pass struct rdtgroup instead of individual members fs/resctrl: Add the functionality to unassign MBM events fs/resctrl: Add the functionality to assign MBM events x86,fs/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC fs/resctrl: Introduce event configuration field in struct mon_evt ...
2025-09-30Input: aw86927 - fix error code in probe()Dan Carpenter1-2/+1
Fix this copy and paste bug. Return "err" instead of PTR_ERR(haptics->regmap). Fixes: 52e06d564ce6 ("Input: aw86927 - add driver for Awinic AW86927") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/aNvMPTnOovdBitdP@stanley.mountain Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2025-09-30Merge tag 'x86_cpu_for_v6.18_rc1' of ↵Linus Torvalds14-50/+329
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: - Make UMIP instruction detection more robust - Correct and cleanup AMD CPU topology detection; document the relevant CPUID leaves topology parsing precedence on AMD - Add support for running the kernel as guest on FreeBSD's Bhyve hypervisor - Cleanups and improvements * tag 'x86_cpu_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/umip: Fix decoding of register forms of 0F 01 (SGDT and SIDT aliases) x86/umip: Check that the instruction opcode is at least two bytes Documentation/x86/topology: Detail CPUID leaves used for topology enumeration x86/cpu/topology: Define AMD64_CPUID_EXT_FEAT MSR x86/cpu/topology: Check for X86_FEATURE_XTOPOLOGY instead of passing has_xtopology x86/cpu/cacheinfo: Simplify cacheinfo_amd_init_llc_id() using _cpuid4_info x86/cpu: Rename and move CPU model entry for Diamond Rapids x86/cpu: Detect FreeBSD Bhyve hypervisor
2025-09-30NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN supportMike Snitzer1-0/+15
NFS doesn't have DIO alignment constraints, so have NFS respond with accommodating DIO alignment attributes (rather than plumb in GETATTR support for STATX_DIOALIGN and STATX_DIO_READ_ALIGN). The most coarse-grained dio_offset_align is the most accommodating (e.g. PAGE_SIZE, in future larger may be supported). Now that NFS has support, NFS reexport will now handle unaligned DIO (NFSD's NFSD_IO_DIRECT support requires the underlying filesystem support STATX_DIOALIGN and/or STATX_DIO_READ_ALIGN). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30nfs/localio: add tracepoints for misaligned DIO READ and WRITE supportMike Snitzer5-13/+90
Add nfs_local_dio_class and use it to create nfs_local_dio_read, nfs_local_dio_write and nfs_local_dio_misaligned trace events. These trace events show how NFS LOCALIO splits a given misaligned IO into a mix of misaligned head and/or tail extents and a DIO-aligned middle extent. The misaligned head and/or tail extents are issued using buffered IO and the DIO-aligned middle is issued using O_DIRECT. This combination of trace events is useful for LOCALIO DIO READs: echo 1 > /sys/kernel/tracing/events/nfs/nfs_local_dio_read/enable echo 1 > /sys/kernel/tracing/events/nfs/nfs_local_dio_misaligned/enable echo 1 > /sys/kernel/tracing/events/nfs/nfs_initiate_read/enable echo 1 > /sys/kernel/tracing/events/nfs/nfs_readpage_done/enable echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable This combination of trace events is useful for LOCALIO DIO WRITEs: echo 1 > /sys/kernel/tracing/events/nfs/nfs_local_dio_write/enable echo 1 > /sys/kernel/tracing/events/nfs/nfs_local_dio_misaligned/enable echo 1 > /sys/kernel/tracing/events/nfs/nfs_initiate_write/enable echo 1 > /sys/kernel/tracing/events/nfs/nfs_writeback_done/enable echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30nfs/localio: add proper O_DIRECT support for READ and WRITEMike Snitzer1-47/+202
Because the NFS client will already happily handle misaligned O_DIRECT IO (by sending it out to NFSD via RPC) this commit's new capabilities are for the benefit of LOCALIO. LOCALIO will make best effort to transform misaligned IO to DIO-aligned extents when possible. LOCALIO's READ and WRITE DIO that is misaligned will be split into as many as 3 component IOs (@start, @middle and @end) as needed -- IFF the @middle extent is verified to be DIO-aligned, and then the @start and/or @end are misaligned (due to each being a partial page). Otherwise if the @middle isn't DIO-aligned the code will fallback to issuing only a single contiguous buffered IO. The @middle is only DIO-aligned if both the memory and on-disk offsets for the IO are aligned relative to the underlying local filesystem's block device limits (@dma_alignment and @logical_block_size respectively). The misaligned @start and/or @end extents are issued using buffered IO and the DIO-aligned @middle is issued using O_DIRECT. The @start and @end IOs are issued first using buffered IO with IOCB_SYNC and then the @middle is issued last using direct IO with async completion (AIO). This out of order IO completion means that LOCALIO's IO completion code (nfs_local_read_done and nfs_local_write_done) is only called for the IO's last associated iov_iter completion. And in the case of DIO-aligned @middle it completes last using AIO. nfs_local_pgio_done() is updated to handle piece-wise partial completion of each iov_iter. This implementation for LOCALIO's misaligned DIO handling uses 3 iov_iter that share the same backing pages in their bio_vecs (so unfortunately 'struct nfs_local_kiocb' has 3 instead of only 1). [Reducing LOCALIO's per-IO (struct nfs_local_kiocb) memory use can be explored in the future. One logical progression to improve this code, and eliminate explicit loops over up to 3 iov_iter, is by extending 'struct iov_iter' to support iov_iter_clone() and iov_iter_chain() interfaces that are comparable to what 'struct bio' is able to support in the block layer. But even that wouldn't avoid the need to allocate/use up to 3 iov_iter] Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30nfs/localio: refactor iocb initializationMike Snitzer1-38/+55
The goal of this commit's various refactoring is to have LOCALIO's per IO initialization occur in process context so that we don't get into a situation where IO fails to be issued from workqueue (e.g. due to lack of memory, etc). Better to have LOCALIO's iocb initialization fail early. There isn't immediate need but this commit makes it possible for LOCALIO to fallback to NFS pagelist code in process context to allow for immediate retry over RPC. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30nfs/localio: refactor iocb and iov_iter_bvec initializationMike Snitzer1-37/+33
nfs_local_iter_init() is updated to follow the same pattern to initializing LOCALIO's iov_iter_bvec as was established by nfsd_iter_read(). Other LOCALIO iocb initialization refactoring in this commit offers incremental cleanup that will be taken further by the next commit. No functional change. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30scsi: ufs: core: Include UTP error in INT_FATAL_ERRORSHoyoung Seo1-1/+3
When a UTP error occurs in isolation, UFS is not currently recoverable. This is because the UTP error is not considered fatal in the error handling code, leading to either an I/O timeout or an OCS error. Add the UTP error flag to INT_FATAL_ERRORS so the controller will be reset in this situation. sd 0:0:0:0: [sda] tag#38 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=DRIVER_OK cmd_age=0s sd 0:0:0:0: [sda] tag#38 CDB: opcode=0x28 28 00 00 51 24 e2 00 00 08 00 I/O error, dev sda, sector 42542864 op 0x0:(READ) flags 0x80700 phys_seg 8 prio class 2 OCS error from controller = 9 for tag 39 pa_err[1] = 0x80000010 at 2667224756 us pa_err: total cnt=2 dl_err[0] = 0x80000002 at 2667148060 us dl_err[1] = 0x80002000 at 2667282844 us No record of nl_err No record of tl_err No record of dme_err No record of auto_hibern8_err fatal_err[0] = 0x804 at 2667282836 us --------------------------------------------------- REGISTER --------------------------------------------------- NAME OFFSET VALUE STD HCI SFR 0xfffffff0 0x0 AHIT 0x18 0x814 INTERRUPT STATUS 0x20 0x1000 INTERRUPT ENABLE 0x24 0x70ef5 [mkp: commit desc] Signed-off-by: Hoyoung Seo <hy50.seo@samsung.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Message-Id: <20250930061428.617955-1-hy50.seo@samsung.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-09-30nfs/localio: avoid issuing misaligned IO using O_DIRECTMike Snitzer3-10/+68
Add nfsd_file_dio_alignment and use it to avoid issuing misaligned IO using O_DIRECT. Any misaligned DIO falls back to using buffered IO. Because misaligned DIO is now handled safely, remove the nfs modparam 'localio_O_DIRECT_semantics' that was added to require users opt-in to the requirement that all O_DIRECT be properly DIO-aligned. Also, introduce nfs_iov_iter_aligned_bvec() which is a variant of iov_iter_aligned_bvec() that also verifies the offset associated with an iov_iter is DIO-aligned. NOTE: in a parallel effort, iov_iter_aligned_bvec() is being removed along with iov_iter_is_aligned(). Lastly, add pr_info_ratelimited if underlying filesystem returns -EINVAL because it was made to try O_DIRECT for IO that is not DIO-aligned (shouldn't happen, so its best to be louder if it does). Fixes: 3feec68563d ("nfs/localio: add direct IO enablement with sync and async IO support") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30nfs/localio: make trace_nfs_local_open_fh more usefulMike Snitzer2-5/+6
Always trigger trace event when LOCALIO opens a file. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN supportMike Snitzer5-0/+91
Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get DIO alignment attributes from the underlying filesystem and store them in the associated nfsd_file. This is done when the nfsd_file is first opened for each regular file. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30sunrpc: unexport rpc_malloc() and rpc_free()Jeff Layton1-2/+0
These are not used outside of sunrpc code. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30Merge tag 'x86_bugs_for_v6.18_rc1' of ↵Linus Torvalds4-275/+214
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mitigation updates from Borislav Petkov: - Add VMSCAPE to the attack vector controls infrastructure - A bunch of the usual cleanups and fixlets, some of them resulting from fuzzing the different mitigation options * tag 'x86_bugs_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bugs: Report correct retbleed mitigation status x86/bugs: Fix reporting of LFENCE retpoline x86/bugs: Fix spectre_v2 forcing x86/bugs: Remove uses of cpu_mitigations_off() x86/bugs: Simplify SSB cmdline parsing x86/bugs: Use early_param() for spectre_v2 x86/bugs: Use early_param() for spectre_v2_user x86/bugs: Add attack vector controls for VMSCAPE x86/its: Move ITS indirect branch thunks to .text..__x86.indirect_thunk
2025-09-30scsi: ufs: sysfs: Make HID attributes visibleDaniel Lee3-1/+4
Call sysfs_update_group() after reading the device descriptor to ensure the HID sysfs attributes are visible when the feature is supported. Fixes: ae7795a8c258 ("scsi: ufs: core: Add HID support") Signed-off-by: Daniel Lee <chullee@google.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-09-30Merge tag 'ras_core_for_v6.18_rc1' of ↵Linus Torvalds6-281/+236
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Borislav Petkov: - Unify and refactor the MCA arch side and better separate code - Cleanup and simplify the AMD RAS side, unify code, drop unused stuff * tag 'ras_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Add a clear_bank() helper x86/mce: Move machine_check_poll() status checks to helper functions x86/mce: Separate global and per-CPU quirks x86/mce: Do 'UNKNOWN' vendor check early x86/mce: Define BSP-only SMCA init x86/mce: Define BSP-only init x86/mce: Set CR4.MCE last during init x86/mce: Remove __mcheck_cpu_init_early() x86/mce: Cleanup bank processing on init x86/mce/amd: Put list_head in threshold_bank x86/mce/amd: Remove smca_banks_map x86/mce/amd: Remove return value for mce_threshold_{create,remove}_device() x86/mce/amd: Rename threshold restart function