summaryrefslogtreecommitdiffstats
path: root/arch/s390/include/asm
AgeCommit message (Collapse)AuthorLines
2026-03-03s390/stackleak: Fix __stackleak_poison() inline assembly constraintHeiko Carstens-1/+1
The __stackleak_poison() inline assembly comes with a "count" operand where the "d" constraint is used. "count" is used with the exrl instruction and "d" means that the compiler may allocate any register from 0 to 15. If the compiler would allocate register 0 then the exrl instruction would not or the value of "count" into the executed instruction - resulting in a stackframe which is only partially poisoned. Use the correct "a" constraint, which excludes register 0 from register allocation. Fixes: 2a405f6bb3a5 ("s390/stackleak: provide fast __stackleak_poison() implementation") Cc: stable@vger.kernel.org Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Link: https://lore.kernel.org/r/20260302133500.1560531-4-hca@linux.ibm.com Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2026-02-25s390/idle: Remove psw_idle() prototypeHeiko Carstens-2/+0
psw_idle() does not exist anymore. Remove its prototype. Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2026-02-25s390/idle: Inline update_timer_idle()Heiko Carstens-1/+36
Inline update_timer_idle() again to avoid an extra function call. This way the generated code is close to old assembler version again. Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2026-02-25s390/idle: Fix cpu idle exit cpu time accountingHeiko Carstens-0/+1
With the conversion to generic entry [1] cpu idle exit cpu time accounting was converted from assembly to C. This introduced an reversed order of cpu time accounting. On cpu idle exit the current accounting happens with the following call chain: -> do_io_irq()/do_ext_irq() -> irq_enter_rcu() -> account_hardirq_enter() -> vtime_account_irq() -> vtime_account_kernel() vtime_account_kernel() accounts the passed cpu time since last_update_timer as system time, and updates last_update_timer to the current cpu timer value. However the subsequent call of -> account_idle_time_irq() will incorrectly subtract passed cpu time from timer_idle_enter to the updated last_update_timer value from system_timer. Then last_update_timer is updated to a sys_enter_timer, which means that last_update_timer goes back in time. Subsequently account_hardirq_exit() will account too much cpu time as hardirq time. The sum of all accounted cpu times is still correct, however some cpu time which was previously accounted as system time is now accounted as hardirq time, plus there is the oddity that last_update_timer goes back in time. Restore previous behavior by extracting cpu time accounting code from account_idle_time_irq() into a new update_timer_idle() function and call it before irq_enter_rcu(). Fixes: 56e62a737028 ("s390: convert to generic entry") [1] Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds-1/+1
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook-3/+3
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-20Merge tag 's390-7.0-2' of ↵Linus Torvalds-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Heiko Carstens: - Make KEXEC_SIG available again for CONFIG_MODULES=n - The s390 topology code used to call rebuild_sched_domains() before common code scheduling domains were setup. This was silently ignored by common code, but now results in a warning. Address by avoiding the early call - Convert debug area lock from spinlock to raw spinlock to address lockdep warnings - The recent 3490 tape device driver rework resulted in a different device driver name, which is visible via sysfs for user space. This breaks at least one user space application. Change the device driver name back to its old name to fix this * tag 's390-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/tape: Fix device driver name s390/debug: Convert debug area lock from a spinlock to a raw spinlock s390/smp: Avoid calling rebuild_sched_domains() early s390/kexec: Make KEXEC_SIG available when CONFIG_MODULES=n
2026-02-18s390/debug: Convert debug area lock from a spinlock to a raw spinlockBenjamin Block-2/+2
With PREEMPT_RT as potential configuration option, spinlock_t is now considered as a sleeping lock, and thus might cause issues when used in an atomic context. But even with PREEMPT_RT as potential configuration option, raw_spinlock_t remains as a true spinning lock/atomic context. This creates potential issues with the s390 debug/tracing feature. The functions to trace errors are called in various contexts, including under lock of raw_spinlock_t, and thus the used spinlock_t in each debug area is in violation of the locking semantics. Here are two examples involving failing PCI Read accesses that are traced while holding `pci_lock` in `drivers/pci/access.c`: ============================= [ BUG: Invalid wait context ] 6.19.0-devel #18 Not tainted ----------------------------- bash/3833 is trying to lock: 0000027790baee30 (&rc->lock){-.-.}-{3:3}, at: debug_event_common+0xfc/0x300 other info that might help us debug this: context-{5:5} 5 locks held by bash/3833: #0: 0000027efbb29450 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x7c/0xf0 #1: 00000277f0504a90 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x13e/0x260 #2: 00000277beed8c18 (kn->active#339){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x164/0x260 #3: 00000277e9859190 (&dev->mutex){....}-{4:4}, at: pci_dev_lock+0x2e/0x40 #4: 00000383068a7708 (pci_lock){....}-{2:2}, at: pci_bus_read_config_dword+0x4a/0xb0 stack backtrace: CPU: 6 UID: 0 PID: 3833 Comm: bash Kdump: loaded Not tainted 6.19.0-devel #18 PREEMPTLAZY Hardware name: IBM 9175 ME1 701 (LPAR) Call Trace: [<00000383048afec2>] dump_stack_lvl+0xa2/0xe8 [<00000383049ba166>] __lock_acquire+0x816/0x1660 [<00000383049bb1fa>] lock_acquire+0x24a/0x370 [<00000383059e3860>] _raw_spin_lock_irqsave+0x70/0xc0 [<00000383048bbb6c>] debug_event_common+0xfc/0x300 [<0000038304900b0a>] __zpci_load+0x17a/0x1f0 [<00000383048fad88>] pci_read+0x88/0xd0 [<00000383054cbce0>] pci_bus_read_config_dword+0x70/0xb0 [<00000383054d55e4>] pci_dev_wait+0x174/0x290 [<00000383054d5a3e>] __pci_reset_function_locked+0xfe/0x170 [<00000383054d9b30>] pci_reset_function+0xd0/0x100 [<00000383054ee21a>] reset_store+0x5a/0x80 [<0000038304e98758>] kernfs_fop_write_iter+0x1e8/0x260 [<0000038304d995da>] new_sync_write+0x13a/0x180 [<0000038304d9c5d0>] vfs_write+0x200/0x330 [<0000038304d9c88c>] ksys_write+0x7c/0xf0 [<00000383059cfa80>] __do_syscall+0x210/0x500 [<00000383059e4c06>] system_call+0x6e/0x90 INFO: lockdep is turned off. ============================= [ BUG: Invalid wait context ] 6.19.0-devel #3 Not tainted ----------------------------- bash/6861 is trying to lock: 0000009da05c7430 (&rc->lock){-.-.}-{3:3}, at: debug_event_common+0xfc/0x300 other info that might help us debug this: context-{5:5} 5 locks held by bash/6861: #0: 000000acff404450 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x7c/0xf0 #1: 000000acff41c490 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x13e/0x260 #2: 0000009da36937d8 (kn->active#75){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x164/0x260 #3: 0000009dd15250d0 (&zdev->state_lock){+.+.}-{4:4}, at: enable_slot+0x2e/0xc0 #4: 000001a19682f708 (pci_lock){....}-{2:2}, at: pci_bus_read_config_byte+0x42/0xa0 stack backtrace: CPU: 16 UID: 0 PID: 6861 Comm: bash Kdump: loaded Not tainted 6.19.0-devel #3 PREEMPTLAZY Hardware name: IBM 9175 ME1 701 (LPAR) Call Trace: [<000001a194837ec2>] dump_stack_lvl+0xa2/0xe8 [<000001a194942166>] __lock_acquire+0x816/0x1660 [<000001a1949431fa>] lock_acquire+0x24a/0x370 [<000001a19596b810>] _raw_spin_lock_irqsave+0x70/0xc0 [<000001a194843b6c>] debug_event_common+0xfc/0x300 [<000001a194888b0a>] __zpci_load+0x17a/0x1f0 [<000001a194882d88>] pci_read+0x88/0xd0 [<000001a195453b88>] pci_bus_read_config_byte+0x68/0xa0 [<000001a195457bc2>] pci_setup_device+0x62/0xad0 [<000001a195458e70>] pci_scan_single_device+0x90/0xe0 [<000001a19488a0f6>] zpci_bus_scan_device+0x46/0x80 [<000001a19547f958>] enable_slot+0x98/0xc0 [<000001a19547f134>] power_write_file+0xc4/0x110 [<000001a194e20758>] kernfs_fop_write_iter+0x1e8/0x260 [<000001a194d215da>] new_sync_write+0x13a/0x180 [<000001a194d245d0>] vfs_write+0x200/0x330 [<000001a194d2488c>] ksys_write+0x7c/0xf0 [<000001a195957a30>] __do_syscall+0x210/0x500 [<000001a19596cbb6>] system_call+0x6e/0x90 INFO: lockdep is turned off. Since it is desired to keep it possible to create trace records in most situations, including this particular case (failing PCI config space accesses are relevant), convert the used spinlock_t in `struct debug_info` to raw_spinlock_t. The impact is small, as the debug area lock only protects bounded memory access without external dependencies, apart from one function debug_set_size() where kfree() is implicitly called with the lock held. Move debug_info_free() out of this lock, to keep remove this external dependency. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-02-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds-422/+72
Pull KVM updates from Paolo Bonzini: "Loongarch: - Add more CPUCFG mask bits - Improve feature detection - Add lazy load support for FPU and binary translation (LBT) register state - Fix return value for memory reads from and writes to in-kernel devices - Add support for detecting preemption from within a guest - Add KVM steal time test case to tools/selftests ARM: - Add support for FEAT_IDST, allowing ID registers that are not implemented to be reported as a normal trap rather than as an UNDEF exception - Add sanitisation of the VTCR_EL2 register, fixing a number of UXN/PXN/XN bugs in the process - Full handling of RESx bits, instead of only RES0, and resulting in SCTLR_EL2 being added to the list of sanitised registers - More pKVM fixes for features that are not supposed to be exposed to guests - Make sure that MTE being disabled on the pKVM host doesn't give it the ability to attack the hypervisor - Allow pKVM's host stage-2 mappings to use the Force Write Back version of the memory attributes by using the "pass-through' encoding - Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the guest - Preliminary work for guest GICv5 support - A bunch of debugfs fixes, removing pointless custom iterators stored in guest data structures - A small set of FPSIMD cleanups - Selftest fixes addressing the incorrect alignment of page allocation - Other assorted low-impact fixes and spelling fixes RISC-V: - Fixes for issues discoverd by KVM API fuzzing in kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(), and kvm_riscv_vcpu_aia_imsic_update() - Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM - Transparent huge page support for hypervisor page tables - Adjust the number of available guest irq files based on MMIO register sizes found in the device tree or the ACPI tables - Add RISC-V specific paging modes to KVM selftests - Detect paging mode at runtime for selftests s390: - Performance improvement for vSIE (aka nested virtualization) - Completely new memory management. s390 was a special snowflake that enlisted help from the architecture's page table management to build hypervisor page tables, in particular enabling sharing the last level of page tables. This however was a lot of code (~3K lines) in order to support KVM, and also blocked several features. The biggest advantages is that the page size of userspace is completely independent of the page size used by the guest: userspace can mix normal pages, THPs and hugetlbfs as it sees fit, and in fact transparent hugepages were not possible before. It's also now possible to have nested guests and guests with huge pages running on the same host - Maintainership change for s390 vfio-pci - Small quality of life improvement for protected guests x86: - Add support for giving the guest full ownership of PMU hardware (contexted switched around the fastpath run loop) and allowing direct access to data MSRs and PMCs (restricted by the vPMU model). KVM still intercepts access to control registers, e.g. to enforce event filtering and to prevent the guest from profiling sensitive host state. This is more accurate, since it has no risk of contention and thus dropped events, and also has significantly less overhead. For more information, see the commit message for merge commit bf2c3138ae36 ("Merge tag 'kvm-x86-pmu-6.20' ...") - Disallow changing the virtual CPU model if L2 is active, for all the same reasons KVM disallows change the model after the first KVM_RUN - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled, even if those were advertised as supported to userspace, - Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs, where KVM would attempt to read CR3 configuring an async #PF entry - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL. Only a few exports that are intended for external usage, and those are allowed explicitly - When checking nested events after a vCPU is unblocked, ignore -EBUSY instead of WARNing. Userspace can sometimes put the vCPU into what should be an impossible state, and spurious exit to userspace on -EBUSY does not really do anything to solve the issue - Also throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is in Wait-For-SIPI, which also resulted in playing whack-a-mole with syzkaller stuffing architecturally impossible states into KVM - Add support for new Intel instructions that don't require anything beyond enumerating feature flags to userspace - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2 - Add WARNs to guard against modifying KVM's CPU caps outside of the intended setup flow, as nested VMX in particular is sensitive to unexpected changes in KVM's golden configuration - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts when the suppression feature is enabled by the guest (currently limited to split IRQCHIP, i.e. userspace I/O APIC). Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an option as some userspaces have come to rely on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective of whether or not userspace I/O APIC supports Directed EOIs) - Clean up KVM's handling of marking mapped vCPU pages dirty - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused ASSERT() macro, most of which could be trivially triggered by the guest and/or user, and all of which were useless - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it more obvious what the weird parameter is used for, and to allow fropping these RTC shenanigans if CONFIG_KVM_IOAPIC=n - Bury all of ioapic.h, i8254.h and related ioctls (including KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y - Add a regression test for recent APICv update fixes - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv() to consolidate the updates, and to co-locate SVI updates with the updates for KVM's own cache of ISR information - Drop a dead function declaration - Minor cleanups x86 (Intel): - Rework KVM's handling of VMCS updates while L2 is active to temporarily switch to vmcs01 instead of deferring the update until the next nested VM-Exit. The deferred updates approach directly contributed to several bugs, was proving to be a maintenance burden due to the difficulty in auditing the correctness of deferred updates, and was polluting "struct nested_vmx" with a growing pile of booleans - Fix an SGX bug where KVM would incorrectly try to handle EPCM page faults, and instead always reflect them into the guest. Since KVM doesn't shadow EPCM entries, EPCM violations cannot be due to KVM interference and can't be resolved by KVM - Fix a bug where KVM would register its posted interrupt wakeup handler even if loading kvm-intel.ko ultimately failed - Disallow access to vmcb12 fields that aren't fully supported, mostly to avoid weirdness and complexity for FRED and other features, where KVM wants enable VMCS shadowing for fields that conditionally exist - Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or refuses to online a CPU) due to a VMCS config mismatch x86 (AMD): - Drop a user-triggerable WARN on nested_svm_load_cr3() failure - Add support for virtualizing ERAPS. Note, correct virtualization of ERAPS relies on an upcoming, publicly announced change in the APM to reduce the set of conditions where hardware (i.e. KVM) *must* flush the RAP - Ignore nSVM intercepts for instructions that are not supported according to L1's virtual CPU model - Add support for expedited writes to the fast MMIO bus, a la VMX's fastpath for EPT Misconfig - Don't set GIF when clearing EFER.SVME, as GIF exists independently of SVM, and allow userspace to restore nested state with GIF=0 - Treat exit_code as an unsigned 64-bit value through all of KVM - Add support for fetching SNP certificates from userspace - Fix a bug where KVM would use vmcb02 instead of vmcb01 when emulating VMLOAD or VMSAVE on behalf of L2 - Misc fixes and cleanups x86 selftests: - Add a regression test for TPR<=>CR8 synchronization and IRQ masking - Overhaul selftest's MMU infrastructure to genericize stage-2 MMU support, and extend x86's infrastructure to support EPT and NPT (for L2 guests) - Extend several nested VMX tests to also cover nested SVM - Add a selftest for nested VMLOAD/VMSAVE - Rework the nested dirty log test, originally added as a regression test for PML where KVM logged L2 GPAs instead of L1 GPAs, to improve test coverage and to hopefully make the test easier to understand and maintain guest_memfd: - Remove kvm_gmem_populate()'s preparation tracking and half-baked hugepage handling. SEV/SNP was the only user of the tracking and it can do it via the RMP - Retroactively document and enforce (for SNP) that KVM_SEV_SNP_LAUNCH_UPDATE and KVM_TDX_INIT_MEM_REGION require the source page to be 4KiB aligned, to avoid non-trivial complexity for something that no known VMM seems to be doing and to avoid an API special case for in-place conversion, which simply can't support unaligned sources - When populating guest_memfd memory, GUP the source page in common code and pass the refcounted page to the vendor callback, instead of letting vendor code do the heavy lifting. Doing so avoids a looming deadlock bug with in-place due an AB-BA conflict betwee mmap_lock and guest_memfd's filemap invalidate lock Generic: - Fix a bug where KVM would ignore the vCPU's selected address space when creating a vCPU-specific mapping of guest memory. Actually this bug could not be hit even on x86, the only architecture with multiple address spaces, but it's a bug nevertheless" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (267 commits) KVM: s390: Increase permitted SE header size to 1 MiB MAINTAINERS: Replace backup for s390 vfio-pci KVM: s390: vsie: Fix race in acquire_gmap_shadow() KVM: s390: vsie: Fix race in walk_guest_tables() KVM: s390: Use guest address to mark guest page dirty irqchip/riscv-imsic: Adjust the number of available guest irq files RISC-V: KVM: Transparent huge page support RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test RISC-V: KVM: Allow Zalasr extensions for Guest/VM KVM: riscv: selftests: Add riscv vm satp modes KVM: riscv: selftests: add Zilsd and Zclsd extension to get-reg-list test riscv: KVM: allow Zilsd and Zclsd extensions for Guest/VM RISC-V: KVM: Skip IMSIC update if vCPU IMSIC state is not initialized RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_rw_attr() RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_has_attr() RISC-V: KVM: Remove unnecessary 'ret' assignment KVM: s390: Add explicit padding to struct kvm_s390_keyop KVM: LoongArch: selftests: Add steal time test case LoongArch: KVM: Add paravirt vcpu_is_preempted() support in guest side LoongArch: KVM: Add paravirt preempt feature in hypervisor side ...
2026-02-12Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of ↵Linus Torvalds-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...
2026-02-12Merge tag 'mm-stable-2026-02-11-19-22' of ↵Linus Torvalds-5/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "powerpc/64s: do not re-activate batched TLB flush" makes arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev) It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - "zram: introduce compressed data writeback" implements data compression for zram writeback (Richard Chang and Sergey Senozhatsky) - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated (David Hildenbrand) - "memcg cleanups" tidies up some memcg code (Chen Ridong) - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" improves DAMOS stat's provided information, deterministic control, and readability (SeongJae Park) - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few issues in the hugetlb cgroup charging selftests (Li Wang) - "Fix va_high_addr_switch.sh test failure - again" addresses several issues in the va_high_addr_switch test (Chunyu Hu) - "mm/damon/tests/core-kunit: extend existing test scenarios" improves the KUnit test coverage for DAMON (Shu Anzai) - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN (Shivank Garg) - "arch, mm: consolidate hugetlb early reservation" reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb (Mike Rapoport) - "mm: clean up anon_vma implementation" cleans up the anon_vma implementation in various ways (Lorenzo Stoakes) - "tweaks for __alloc_pages_slowpath()" does a little streamlining of the page allocator's slowpath code (Vlastimil Babka) - "memcg: separate private and public ID namespaces" cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace (Shakeel Butt) - "mm: hugetlb: allocate frozen gigantic folio" cleans up the allocation of frozen folios and avoids some atomic refcount operations (Kefeng Wang) - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals (SeongJae Park) - "Support page table check on PowerPC" makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan) - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code (Yury Norov) - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up some DAMON internal interfaces (SeongJae Park) - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work in khupaged and fixes a scan limit accounting issue (Shivank Garg) - "mm: balloon infrastructure cleanups" goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification (David Hildenbrand) - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds additional tracepoints to the page reclaim code (Jiayuan Chen) - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues (Marco Crivellari) - "Various mm kselftests improvements/fixes" provides various unrelated improvements/fixes for the mm kselftests (Kevin Brodsky) - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig() (Kefeng Wang) - "selftests/damon: improve leak detection and wss estimation reliability" improves the reliability of two of the DAMON selftests (SeongJae Park) - "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" does some cleanup work in the core DAMON code (SeongJae Park) - "Docs/mm/damon: update intro, modules, maintainer profile, and misc" performs maintenance work on the DAMON documentation (SeongJae Park) - "mm: add and use vma_assert_stabilised() helper" refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires (Lorenzo Stoakes) - "mm, swap: swap table phase II: unify swapin use" removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark (Kairui Song) - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, and um. Various cleanups were performed along the way (Qi Zheng) * tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits) mm/memory: handle non-split locks correctly in zap_empty_pte_table() mm: move pte table reclaim code to memory.c mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config um: mm: enable MMU_GATHER_RCU_TABLE_FREE parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE mips: mm: enable MMU_GATHER_RCU_TABLE_FREE LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles zsmalloc: make common caches global mm: add SPDX id lines to some mm source files mm/zswap: use %pe to print error pointers mm/vmscan: use %pe to print error pointers mm/readahead: fix typo in comment mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file() mm: refactor vma_map_pages to use vm_insert_pages mm/damon: unify address range representation with damon_addr_range mm/cma: replace snprintf with strscpy in cma_new_area ...
2026-02-10Merge tag 'perf-core-2026-02-09' of ↵Linus Torvalds-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance event updates from Ingo Molnar: "x86 PMU driver updates: - Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs (Dapeng Mi) Compared to previous iterations of the Intel PMU code, there's been a lot of changes, which center around three main areas: - Introduce the OFF-MODULE RESPONSE (OMR) facility to replace the Off-Core Response (OCR) facility - New PEBS data source encoding layout - Support the new "RDPMC user disable" feature - Likewise, a large series adds uncore PMU support for Intel Diamond Rapids (DMR) CPUs (Zide Chen) This centers around these four main areas: - DMR may have two Integrated I/O and Memory Hub (IMH) dies, separate from the compute tile (CBB) dies. Each CBB and each IMH die has its own discovery domain. - Unlike prior CPUs that retrieve the global discovery table portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON discovery and MSR for CBB PMON discovery. - DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA, UBR, PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6. - IIO free-running counters in DMR are MMIO-based, unlike SPR. - Also add support for Add missing PMON units for Intel Panther Lake, and support Nova Lake (NVL), which largely maps to Panther Lake. (Zide Chen) - KVM integration: Add support for mediated vPMUs (by Kan Liang and Sean Christopherson, with fixes and cleanups by Peter Zijlstra, Sandipan Das and Mingwei Zhang) - Add Intel cstate driver to support for Wildcat Lake (WCL) CPUs, which are a low-power variant of Panther Lake (Zide Chen) - Add core, cstate and MSR PMU support for the Airmont NP Intel CPU (aka MaxLinear Lightning Mountain), which maps to the existing Airmont code (Martin Schiller) Performance enhancements: - Speed up kexec shutdown by avoiding unnecessary cross CPU calls (Jan H. Schönherr) - Fix slow perf_event_task_exit() with LBR callstacks (Namhyung Kim) User-space stack unwinding support: - Various cleanups and refactorings in preparation to generalize the unwinding code for other architectures (Jens Remus) Uprobes updates: - Transition from kmap_atomic to kmap_local_page (Keke Ming) - Fix incorrect lockdep condition in filter_chain() (Breno Leitao) - Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov) Misc fixes and cleanups: - s390: Remove kvm_types.h from Kbuild (Randy Dunlap) - x86/intel/uncore: Convert comma to semicolon (Chen Ni) - x86/uncore: Clean up const mismatch (Greg Kroah-Hartman) - x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)" * tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits) s390: remove kvm_types.h from Kbuild uprobes: Fix incorrect lockdep condition in filter_chain() x86/ibs: Fix typo in dc_l2tlb_miss comment x86/uprobes: Fix XOL allocation failure for 32-bit tasks perf/x86/intel/uncore: Convert comma to semicolon perf/x86/intel: Add support for rdpmc user disable feature perf/x86: Use macros to replace magic numbers in attr_rdpmc perf/x86/intel: Add core PMU support for Novalake perf/x86/intel: Add support for PEBS memory auxiliary info field in NVL perf/x86/intel: Add core PMU support for DMR perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR perf/x86/intel: Support the 4 new OMR MSRs introduced in DMR and NVL perf/core: Fix slow perf_event_task_exit() with LBR callstacks perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls uprobes: use kmap_local_page() for temporary page mappings arm/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() mips/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() arm64/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() riscv/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() perf/x86/intel/uncore: Add Nova Lake support ...
2026-02-10Merge tag 'v7.0-p1' of ↵Linus Torvalds-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu: "API: - Fix race condition in hwrng core by using RCU Algorithms: - Allow authenc(sha224,rfc3686) in fips mode - Add test vectors for authenc(hmac(sha384),cbc(aes)) - Add test vectors for authenc(hmac(sha224),cbc(aes)) - Add test vectors for authenc(hmac(md5),cbc(des3_ede)) - Add lz4 support in hisi_zip - Only allow clear key use during self-test in s390/{phmac,paes} Drivers: - Set rng quality to 900 in airoha - Add gcm(aes) support for AMD/Xilinx Versal device - Allow tfms to share device in hisilicon/trng" * tag 'v7.0-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (100 commits) crypto: img-hash - Use unregister_ahashes in img_{un}register_algs crypto: testmgr - Add test vectors for authenc(hmac(md5),cbc(des3_ede)) crypto: cesa - Simplify return statement in mv_cesa_dequeue_req_locked crypto: testmgr - Add test vectors for authenc(hmac(sha224),cbc(aes)) crypto: testmgr - Add test vectors for authenc(hmac(sha384),cbc(aes)) hwrng: core - use RCU and work_struct to fix race condition crypto: starfive - Fix memory leak in starfive_aes_aead_do_one_req() crypto: xilinx - Fix inconsistant indentation crypto: rng - Use unregister_rngs in register_rngs crypto: atmel - Use unregister_{aeads,ahashes,skciphers} hwrng: optee - simplify OP-TEE context match crypto: ccp - Add sysfs attribute for boot integrity dt-bindings: crypto: atmel,at91sam9g46-sha: add microchip,lan9691-sha dt-bindings: crypto: atmel,at91sam9g46-aes: add microchip,lan9691-aes dt-bindings: crypto: qcom,inline-crypto-engine: document the Milos ICE crypto: caam - fix netdev memory leak in dpaa2_caam_probe crypto: hisilicon/qm - increase wait time for mailbox crypto: hisilicon/qm - obtain the mailbox configuration at one time crypto: hisilicon/qm - remove unnecessary code in qm_mb_write() crypto: hisilicon/qm - move the barrier before writing to the mailbox register ...
2026-02-05s390: remove kvm_types.h from KbuildRandy Dunlap-1/+0
kvm_types.h is mandatory in include/asm-generic/Kbuild so having it in another Kbuild file causes a warning. Remove it from the arch/ Kbuild file to fix the warning. ../scripts/Makefile.asm-headers:39: redundant generic-y found in ../arch/s390/include/asm/Kbuild: kvm_types.h Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260203184204.1329414-1-rdunlap@infradead.org
2026-02-04KVM: S390: Remove PGSTE code from linux/s390 mmClaudio Imbrenda-141/+7
Remove the PGSTE config option. Remove all code from linux/s390 mm that involves PGSTEs. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04KVM: s390: Remove gmap from s390/mmClaudio Imbrenda-182/+0
Remove the now unused include/asm/gmap.h and mm/gmap.c files. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04KVM: s390: Switch to new gmapClaudio Imbrenda-69/+14
Switch KVM/s390 to use the new gmap code. Remove includes to <gmap.h> and include "gmap.h" instead; fix all the existing users of the old gmap functions to use the new ones instead. Fix guest storage key access functions to work with the new gmap. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04KVM: s390: Stop using CONFIG_PGSTEClaudio Imbrenda-3/+3
Switch to using IS_ENABLED(CONFIG_KVM) instead of CONFIG_PGSTE, since the latter will be removed soon. Many CONFIG_PGSTE are left behind, because they will be removed completely in upcoming patches. The ones replaced here are mostly the ones that will stay. Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04KVM: s390: Add helper functions for fault handlingClaudio Imbrenda-0/+1
Add some helper functions for handling multiple guest faults at the same time. This will be needed for VSIE, where a nested guest access also needs to access all the page tables that map it. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04KVM: s390: Enable KVM_GENERIC_MMU_NOTIFIERClaudio Imbrenda-0/+1
Enable KVM_GENERIC_MMU_NOTIFIER, for now with empty placeholder callbacks. Also enable KVM_MMU_LOCKLESS_AGING and define KVM_HAVE_MMU_RWLOCK. Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04s390/mm: Warn if uv_convert_from_secure_pte() failsClaudio Imbrenda-4/+5
If uv_convert_from_secure_pte() fails, the page becomes unusable by the host. The failure can only occour in case of hardware malfunction or a serious KVM bug. When the unusable page is reused, the system can have issues and hang. Print a warning to aid debugging such unlikely scenarios. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04KVM: s390: Export two functionsClaudio Imbrenda-0/+2
Export __make_folio_secure() and s390_wiggle_split_folio(), as they will be needed to be used by KVM. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04KVM: s390: Introduce import_lockClaudio Imbrenda-0/+2
Introduce import_lock to avoid future races when converting pages to secure. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04KVM: s390: Add gmap_helper_set_unused()Claudio Imbrenda-0/+1
Add gmap_helper_set_unused() to mark userspace ptes as unused. Core mm code will use that information to discard unused pages instead of attempting to swap them. Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Tested-by: Nico Boehr <nrb@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04s390: Move sske_frame() to a headerClaudio Imbrenda-0/+7
Move the sske_frame() function to asm/pgtable.h, so it can be used in other modules too. Opportunistically convert the .insn opcode specification to the appropriate mnemonic. Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04KVM: s390: Add P bit in table entry bitfields, move union vaddressClaudio Imbrenda-2/+30
Add P bit in hardware definition of region 3 and segment table entries. Move union vaddress from kvm/gaccess.c to asm/dat_bits.h Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04KVM: s390: Refactor pgste lock and unlock functionsClaudio Imbrenda-22/+0
Move the pgste lock and unlock functions back into mm/pgtable.c and duplicate them in mm/gmap_helpers.c to avoid function name collisions later on. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-01-31kernel.h: include linux/instruction_pointer.h explicitlyYury Norov-0/+1
In preparation for decoupling linux/instruction_pointer.h and linux/kernel.h, include instruction_pointer.h explicitly where needed. Link: https://lkml.kernel.org/r/20260116042510.241009-5-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31s390/pkey: Support new xflag PKEY_XFLAG_NOCLEARKEYHarald Freudenberger-1/+7
Introduce a new xflag PKEY_XFLAG_NOCLEARKEY which when given refuses the conversion of "clear key tokens" to protected key material. Some algorithms (PAES, PHMAC) have the need to construct "clear key tokens" to be used during selftest. But in general these algorithms should only support clear key material for testing purpose. So now the algorithm implementation can signal via xflag PKEY_XFLAG_NOCLEARKEY that a conversion of clear key material to protected key is not acceptable and thus the pkey layer (usually one of the handler modules) refuses clear key material with -EINVAL. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Holger Dengler <dengler@linux.ibm.com> Reviewed-by: Ingo Franzki <ifranzki@linux.ibm.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-01-27s390/Kconfig: Define non-zero ILLEGAL_POINTER_VALUEGerd Bayer-0/+1
Define CONFIG_ILLEGAL_POINTER_VALUE to the eye-catching non-zero value of 0xdead000000000000, consistent with other architectures. Assert at compile-time that the poison pointers that include/linux/poison.h defines based on this illegal pointer are beyond the largest useful virtual addresses. Also, assert at compile-time that the range of poison pointers per include/linux/poison.h (currently a range of less than 0x10000 addresses) does not overlap with the range used for address handles for s390's non-MIO PCI instructions. This enables s390 to track the DMA mappings by the network stack's page_pool that was introduced with [0]. Other functional changes are not intended. Other archictectures have introduced this for various other reasons with commit 5c178472af24 ("riscv: define ILLEGAL_POINTER_VALUE for 64bit") commit f6853eb561fb ("powerpc/64: Define ILLEGAL_POINTER_VALUE for 64-bit") commit bf0c4e047324 ("arm64: kconfig: Move LIST_POISON to a safe value") commit a29815a333c6 ("core, x86: make LIST_POISON less deadly") [0] https://lore.kernel.org/all/20250409-page-pool-track-dma-v9-0-6a9ef2e0cba8@redhat.com/ Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-01-27s390/bug: Prevent tail-call optimizationHeiko Carstens-0/+2
For the exception based __WARN_trap() implementation it is technically not necessary to prevent tail-call optimization, however it may be confusing to see warning messages like: WARNING: arch/s390/kernel/setup.c:1017 at foobar+0x2c/0x50, CPU#0: swapper/0/0 together with a disassembly of a different function caused by tail-call optimization for the __WARN_trap() call. Prevent that by adding an empty asm statement. This generates slightly worse code, but should hopefully avoid confusion. With this the output looks like: WARNING: arch/s390/kernel/setup.c:1017 at foobar+0x2c/0x50, CPU#0: swapper/0/0 ... Krnl PSW : 0704c00180000000 000003ffe0119788 (foobar+0x38/0x50) ... Krnl Code: 000003ffe0119776: e3e0f0980024 stg %r14,152(%r15) 000003ffe011977c: c02000b8992a larl %r2,000003ffe182c9d0 *000003ffe0119782: c0e5007270b7 brasl %r14,000003ffe0f678f0 >000003ffe0119788: ebeff0a00004 lmg %r14,%r15,160(%r15) 000003ffe011978e: 07fe bcr 15,%r14 000003ffe0119790: 47000700 bc 0,1792 000003ffe0119794: 0707 bcr 0,%r7 000003ffe0119796: 0707 bcr 0,%r7 Call Trace: [<000003ffe0119788>] foobar+0x38/0x50 [<000003ffe185bc2e>] arch_cpu_finalize_init+0x26/0x60 [<000003ffe185654c>] start_kernel+0x53c/0x5d8 [<000003ffe010002e>] startup_continue+0x2e/0x40 A better solution would be to replace or patch the branch instruction to __WARN_trap() with the monitor call instruction, similar to what is done for x86 [1]. However s390 does not support static_cond_calls(). Therefore use the simple approach for the time being. [1] commit 860238af7a33 ("x86_64/bug: Inline the UD1") Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-01-27s390/bug: Implement WARN_ONCE()Heiko Carstens-0/+11
This is the s390 variant of commit 11bb4944f014 ("x86/bug: Implement WARN_ONCE()"). Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-01-27s390/bug: Implement __WARN_printf()Heiko Carstens-6/+57
This is the s390 variant of commit 5b472b6e5bd9 ("x86_64/bug: Implement __WARN_printf()"). See the x86 commit for the general idea; there are only implementation details which are different. With the new exception based __WARN_printf() implementation the generated code for a simple WARN() is simplified. For example: void foo(int a) { WARN(a, "bar"); } Before this change the generated code looks like this: 0000000000000210 <foo>: 210: c0 04 00 00 00 00 jgnop 210 <foo> 216: ec 26 00 06 00 7c cgijne %r2,0,222 <foo+0x12> 21c: c0 f4 00 00 00 00 jg 21c <foo+0xc> 21e: R_390_PC32DBL __s390_indirect_jump_r14+0x2 222: eb ef f0 88 00 24 stmg %r14,%r15,136(%r15) 228: b9 04 00 ef lgr %r14,%r15 22c: e3 f0 ff e8 ff 71 lay %r15,-24(%r15) 232: e3 e0 f0 98 00 24 stg %r14,152(%r15) 238: c0 20 00 00 00 00 larl %r2,238 <foo+0x28> 23a: R_390_PC32DBL .LC48+0x2 23e: c0 e5 00 00 00 00 brasl %r14,23e <foo+0x2e> 240: R_390_PLT32DBL __warn_printk+0x2 244: af 00 00 00 mc 0,0 248: eb ef f0 a0 00 04 lmg %r14,%r15,160(%r15) 24e: c0 f4 00 00 00 00 jg 24e <foo+0x3e> 250: R_390_PC32DBL __s390_indirect_jump_r14+0x2 With this change the generated code looks like this: 0000000000000210 <foo>: 210: c0 04 00 00 00 00 jgnop 210 <foo> 216: ec 26 00 06 00 7c cgijne %r2,0,222 <foo+0x12> 21c: c0 f4 00 00 00 00 jg 21c <foo+0xc> 21e: R_390_PC32DBL __s390_indirect_jump_r14+0x2 222: c0 20 00 00 00 00 larl %r2,222 <foobar+0x12> 224: R_390_PC32DBL __bug_table+0x2 228: c0 f4 00 00 00 00 jg 228 <foobar+0x18> 22a: R_390_PLT32DBL __WARN_trap+0x2 Downside is that the call trace now starts at __WARN_trap(): ------------[ cut here ]------------ bar WARNING: arch/s390/kernel/setup.c:1017 at 0x0, CPU#0: swapper/0/0 ... Krnl PSW : 0704c00180000000 000003ffe0f6a3b4 (__WARN_trap+0x4/0x10) ... Krnl Code: 000003ffe0f6a3ac: 0707 bcr 0,%r7 000003ffe0f6a3ae: 0707 bcr 0,%r7 *000003ffe0f6a3b0: af000001 mc 1,0 >000003ffe0f6a3b4: 07fe bcr 15,%r14 000003ffe0f6a3b6: 47000700 bc 0,1792 000003ffe0f6a3ba: 0707 bcr 0,%r7 000003ffe0f6a3bc: 0707 bcr 0,%r7 000003ffe0f6a3be: 0707 bcr 0,%r7 Call Trace: [<000003ffe0f6a3b4>] __WARN_trap+0x4/0x10 ([<000003ffe185a54c>] start_kernel+0x53c/0x5d8) [<000003ffe010002e>] startup_continue+0x2e/0x40 Which isn't too helpful. This can be addressed by just skipping __WARN_trap(), which will be addressed in a later patch. Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-01-27s390/traps: Copy monitor code to pt_regsHeiko Carstens-1/+4
In case of a monitor call program check the CPU stores the monitor code to lowcore. Let the program check handler copy it to the pt_regs structure so it can be used by the monitor call exception handler. Instead of increasing the pt_regs size add a union which contains both orig_gpr2 and monitor_code, since orig_gpr2 is not used in case of a program check. Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-01-27s390/bug: Introduce and use monitor code macroHeiko Carstens-2/+8
The first operand address of the monitor call (mc) instruction is the monitor code. Currently the monitor code is ignored, but this will change. Therefore add and use MONCODE_BUG instead of a hardcoded zero. Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-01-27s390/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILEDHeiko Carstens-4/+13
This is just the s390 variant of commit 4f1b701f24be ("x86/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILED"). Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-01-27s390/bug: Convert to inline assembly with input operandsHeiko Carstens-42/+31
Rewrite the bug inline assembly so it uses input operands again instead of pure macro replacements. This more or less reverts the conversion done when 'cond_str' support was added [1]. Reason for this is that the upcoming __WARN_printf() implementation requires an inline assembly with an output operand. At the same time input strings (format specifier and condition string) may contain the special '%' character. As soon as an inline assembly is specified to have input/output operands the '%' has a special meaning: e.g. converting the existing #define __BUG_FLAGS(cond_str, flags) \ asm_inline volatile(__stringify(ASM_BUG_FLAGS(cond_str, flags))); to #define __BUG_FLAGS(cond_str, flags) \ asm_inline volatile(__stringify(ASM_BUG_FLAGS(cond_str, flags))::); will result in a compile error as soon as 'cond_str' contains a '%' character: net/core/neighbour.c: In function ‘neigh_table_init’: ././include/linux/compiler_types.h:546:20: error: invalid 'asm': invalid %-code ... net/core/neighbour.c:1838:17: note: in expansion of macro ‘WARN_ON’ 1838 | WARN_ON(tbl->entry_size % NEIGH_PRIV_ALIGN); | ^~~~~~~ Convert the code, use immediate operands, and also add comments similar to x86 which are emitted to the generated assembly file, which makes debugging much easier. Note: since gcc-8 does not support strings as immediate input operands, guard the new implementation with CC_HAS_ASM_IMMEDIATE_STRINGS and fallback to the generic non-exception based warning implementation for incompatible compilers. [1] 6584ff203aec ("bugs/s390: Use 'cond_str' in __EMIT_BUG()") Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-01-20mm/mmu_gather: remove @delay_remap of __tlb_remove_page_size()Wei Yang-4/+2
__tlb_remove_page_size() is only used in tlb_remove_page_size() with @delay_remap set to false and it is passed directly to __tlb_remove_folio_pages_size(). Remove @delay_remap of __tlb_remove_page_size() and call __tlb_remove_folio_pages_size() with false @delay_remap. Link: https://lkml.kernel.org/r/20251231030026.15938-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: SeongJae Park <sj@kernel.org> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Will Deacon <will@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20treewide: provide a generic clear_user_page() variantDavid Hildenbrand-1/+0
Patch series "mm: folio_zero_user: clear page ranges", v11. This series adds clearing of contiguous page ranges for hugepages. The series improves on the current discontiguous clearing approach in two ways: - clear pages in a contiguous fashion. - use batched clearing via clear_pages() wherever exposed. The first is useful because it allows us to make much better use of hardware prefetchers. The second, enables advertising the real extent to the processor. Where specific instructions support it (ex. string instructions on x86; "mops" on arm64 etc), a processor can optimize based on this because, instead of seeing a sequence of 8-byte stores, or a sequence of 4KB pages, it sees a larger unit being operated on. For instance, AMD Zen uarchs (for extents larger than LLC-size) switch to a mode where they start eliding cacheline allocation. This is helpful not just because it results in higher bandwidth, but also because now the cache is not evicting useful cachelines and replacing them with zeroes. Demand faulting a 64GB region shows performance improvement: $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 baseline +series (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 11.76 +- 1.10% 25.34 +- 1.18% [*] +115.47% preempt=* pg-sz=1GB 24.85 +- 2.41% 39.22 +- 2.32% + 57.82% preempt=none|voluntary pg-sz=1GB (similar) 52.73 +- 0.20% [#] +112.19% preempt=full|lazy [*] This improvement is because switching to sequential clearing allows the hardware prefetchers to do a much better job. [#] For pg-sz=1GB a large part of the improvement is because of the cacheline elision mentioned above. preempt=full|lazy improves upon that because, not needing explicit invocations of cond_resched() to ensure reasonable preemption latency, it can clear the full extent as a single unit. In comparison the maximum extent used for preempt=none|voluntary is PROCESS_PAGES_NON_PREEMPT_BATCH (32MB). When provided the full extent the processor forgoes allocating cachelines on this path almost entirely. (The hope is that eventually, in the fullness of time, the lazy preemption model will be able to do the same job that none or voluntary models are used for, allowing us to do away with cond_resched().) Raghavendra also tested previous version of the series on AMD Genoa and sees similar improvement [1] with preempt=lazy. $ perf bench mem map -p $page-size -f populate -s 64GB -l 10 base patched change pg-sz=2MB 12.731939 GB/sec 26.304263 GB/sec 106.6% pg-sz=1GB 26.232423 GB/sec 61.174836 GB/sec 133.2% This patch (of 8): Let's drop all variants that effectively map to clear_page() and provide it in a generic variant instead. We'll use the macro clear_user_page to indicate whether an architecture provides it's own variant. Also, clear_user_page() is only called from the generic variant of clear_user_highpage(), so define it only if the architecture does not provide a clear_user_highpage(). And, for simplicity define it in linux/highmem.h. Note that for parisc, clear_page() and clear_user_page() map to clear_page_asm(), so we can just get rid of the custom clear_user_page() implementation. There is a clear_user_page_asm() function on parisc, that seems to be unused. Not sure what's up with that. Link: https://lkml.kernel.org/r/20260107072009.1615991-1-ankur.a.arora@oracle.com Link: https://lkml.kernel.org/r/20260107072009.1615991-2-ankur.a.arora@oracle.com Signed-off-by: David Hildenbrand <david@redhat.com> Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Hildenbrand <david@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-17s390/preempt: Optimize __preempt_count_dec_and_test()Heiko Carstens-0/+15
Provide an inline assembly using alternatives to avoid the need of a base register due to relocatable lowcore when adding or subtracting small constants from preempt_count. Main user is preempt_enable(), which subtracts one from preempt_count and tests if the result is zero. With this the generated code changes from 1000b8: a7 19 00 00 lghi %r1,0 1000bc: eb ff 13 a8 00 6e alsi 936(%r1),-1 1000c2: a7 54 00 05 jnhe 1000cc <__rcu_read_unlock+0x14> to something like this: 1000b8: eb ff 03 a8 00 6e alsi 936,-1 1000be: a7 54 00 05 jnhe 1000c8 <__rcu_read_unlock+0x10> Kernel image size is reduced by 45kb (bloat-o-meter -t, defconfig, gcc15). Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-01-17s390/asm: Let __HAVE_ASM_FLAG_OUTPUTS__ define 1Heiko Carstens-1/+1
With the empty define __is_enabled(__HAVE_ASM_FLAG_OUTPUTS__) evaluates to false. Therefore let __HAVE_ASM_FLAG_OUTPUTS__ define 1 if it is defined. This allows to make use of __is_defined(__HAVE_ASM_FLAG_OUTPUTS__) like expected. Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-01-17s390/preempt: Optimize __preemp_count_add()/__preempt_count_sub()Heiko Carstens-1/+11
Provide an inline assembly using alternatives to avoid the need of a base register due to relocatable lowcore when adding or subtracting small constants from preempt_count. Main user is preempt_disable(), which subtracts one from preempt_count. With this the generated code changes from 10012c: a7 b9 00 00 lghi %r11,0 100130: eb 01 b3 a8 00 6a asi 936(%r11),1 to something like this: 10012c: eb 01 03 a8 00 6a asi 936,1 Kernel image size is reduced by 13kb (bloat-o-meter -t, defconfig, gcc15). Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-01-17s390/preempt: Optimize preempt_count()Heiko Carstens-2/+18
Provide an inline assembly using alternatives to avoid the need of a base register when reading preempt_count() from lowcore. Use the LLGT instruction, which reads only the least significant 31 bits of preempt_count. This masks out the encoded PREEMPT_NEED_RESCHED bit. Generated code is changed from 000000000046e5d0 <vfree>: 46e5d0: c0 04 00 00 00 00 jgnop 46e5d0 <vfree> 46e5d6: a7 39 00 00 lghi %r3,0 46e5da: 58 10 33 a8 l %r1,936(%r3) 46e5de: c0 1b 00 ff ff 00 nilf %r1,16776960 46e5e4: a7 74 00 11 jne 46e606 <vfree+0x36> to something like this: 000000000046e5d0 <vfree>: 46e5d0: c0 04 00 00 00 00 jgnop 46e5d0 <vfree> 46e5d6: e3 10 03 a8 00 17 llgt %r1,936 46e5dc: ec 41 28 b7 00 55 risbgz %r4,%r1,40,55 46e5e2: a7 74 00 0f jne 46e600 <vfree+0x30> Overall savings are only 82 bytes according to bloat-o-meter. This is because of different inlining decisions, and there aren't many preempt_count() users in the kernel. Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2026-01-02s390/ap: Fix typo in function name referenceJulia Lawall-1/+1
Add missing s into ap_intructions_available. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Reviewed-by: Jimmy Brisson <jbrisson@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-12-22s390/ptrace: Convert function macros to inline functionsJens Remus-11/+26
Convert the function macros user_mode(), instruction_pointer(), and user_stack_pointer() to inline functions, to align their definition with x86 and arm64. Use const qualifier on struct pt_regs pointer parameters to prevent compiler warnings: arch/s390/kernel/stacktrace.c: In function ‘arch_stack_walk_user_common’: arch/s390/kernel/stacktrace.c:114:34: warning: passing argument 1 of ‘instruction_pointer’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers] ... arch/s390/kernel/stacktrace.c:117:48: warning: passing argument 1 of ‘user_stack_pointer’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers] ... While at it add const qualifier to all struct pt_regs pointer parameters that are accessed read-only and use __always_inline instead of inline to harmonize the helper functions. [hca@linux.ibm.com: Use psw_bits() helper] Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-12-08s390/bug: Add missing alignmentHeiko Carstens-0/+1
All objects are supposed to have a minimal alignment of two, since a couple of instructions only work with even addresses. Add the missing align statement for the file string. Fixes: 6584ff203aec ("bugs/s390: Use 'cond_str' in __EMIT_BUG()") Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-12-08s390/bug: Add missing CONFIG_BUG ifdef againHeiko Carstens-0/+4
Fallback to generic BUG implementation in case CONFIG_BUG is disabled. This restores the old behaviour before 'cond_str' support was added. It probably doesn't matter, since nobody should disable CONFIG_BUG, but at least this is consistent to before. Fixes: 6584ff203aec ("bugs/s390: Use 'cond_str' in __EMIT_BUG()") Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-12-07s390/pci: Migrate s390 IRQ logic to IRQ domain APITobias Schumacher-0/+5
s390 is one of the last architectures using the legacy API for setup and teardown of PCI MSI IRQs. Migrate the s390 IRQ allocation and teardown to the MSI parent domain API. For details, see: https://lore.kernel.org/lkml/20221111120501.026511281@linutronix.de In detail, create an MSI parent domain for each PCI domain. When a PCI device sets up MSI or MSI-X IRQs, the library creates a per-device IRQ domain for this device, which is used by the device for allocating and freeing IRQs. The per-device domain delegates this allocation and freeing to the parent-domain. In the end, the corresponding callbacks of the parent domain are responsible for allocating and freeing the IRQs. The allocation is split into two parts: - zpci_msi_prepare() is called once for each device and allocates the required resources. On s390, each PCI function has its own airq vector and a summary bit, which must be configured once per function. This is done in prepare(). - zpci_msi_alloc() can be called multiple times for allocating one or more MSI/MSI-X IRQs. This creates a mapping between the virtual IRQ number in the kernel and the hardware IRQ number. Freeing is split into two counterparts: - zpci_msi_free() reverts the effects of zpci_msi_alloc() and - zpci_msi_teardown() reverts the effects of zpci_msi_prepare(). This is called once when all IRQs are freed before a device is removed. Since the parent domain in the end allocates the IRQs, the hwirq encoding must be unambiguous for all IRQs of all devices. This is achieved by encoding the hwirq using the devfn and the MSI index. Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Signed-off-by: Tobias Schumacher <ts@linux.ibm.com> Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-12-07s390/vmem: Support 2G page splitting for KASAN shadow freeingVasily Gorbik-0/+2
Export split_pud_page() so it can be used from the vmem code and teach modify_pud_table() to split PUD-sized mappings when only a subrange needs to be removed. If the range to be removed covers a full PUD-sized mapping, keep the existing behavior: clear the PUD entry and free the backing large page (for non-direct mappings). Otherwise, split the PUD-mapped page into PMD mappings and let the walker handle the smaller ranges. This is needed for KASAN early shadow removal support: memory hotplug freeing the KASAN early shadow is the only expected caller that will try to free 2G PUD-mapped regions of non-direct mappings. Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-12-05Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds-4/+5
Pull KVM updates from Paolo Bonzini: "ARM: - Support for userspace handling of synchronous external aborts (SEAs), allowing the VMM to potentially handle the abort in a non-fatal manner - Large rework of the VGIC's list register handling with the goal of supporting more active/pending IRQs than available list registers in hardware. In addition, the VGIC now supports EOImode==1 style deactivations for IRQs which may occur on a separate vCPU than the one that acked the IRQ - Support for FEAT_XNX (user / privileged execute permissions) and FEAT_HAF (hardware update to the Access Flag) in the software page table walkers and shadow MMU - Allow page table destruction to reschedule, fixing long need_resched latencies observed when destroying a large VM - Minor fixes to KVM and selftests Loongarch: - Get VM PMU capability from HW GCFG register - Add AVEC basic support - Use 64-bit register definition for EIOINTC - Add KVM timer test cases for tools/selftests RISC/V: - SBI message passing (MPXY) support for KVM guest - Give a new, more specific error subcode for the case when in-kernel AIA virtualization fails to allocate IMSIC VS-file - Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually in small chunks - Fix guest page fault within HLV* instructions - Flush VS-stage TLB after VCPU migration for Andes cores s390: - Always allocate ESCA (Extended System Control Area), instead of starting with the basic SCA and converting to ESCA with the addition of the 65th vCPU. The price is increased number of exits (and worse performance) on z10 and earlier processor; ESCA was introduced by z114/z196 in 2010 - VIRT_XFER_TO_GUEST_WORK support - Operation exception forwarding support - Cleanups x86: - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE caching is disabled, as there can't be any relevant SPTEs to zap - Relocate a misplaced export - Fix an async #PF bug where KVM would clear the completion queue when the guest transitioned in and out of paging mode, e.g. when handling an SMI and then returning to paged mode via RSM - Leave KVM's user-return notifier registered even when disabling virtualization, as long as kvm.ko is loaded. On reboot/shutdown, keeping the notifier registered is ok; the kernel does not use the MSRs and the callback will run cleanly and restore host MSRs if the CPU manages to return to userspace before the system goes down - Use the checked version of {get,put}_user() - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC timers can result in a hard lockup in the host - Revert the periodic kvmclock sync logic now that KVM doesn't use a clocksource that's subject to NTP corrections - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter behind CONFIG_CPU_MITIGATIONS - Context switch XCR0, XSS, and PKRU outside of the entry/exit fast path; the only reason they were handled in the fast path was to paper of a bug in the core #MC code, and that has long since been fixed - Add emulator support for AVX MOV instructions, to play nice with emulated devices whose guest drivers like to access PCI BARs with large multi-byte instructions x86 (AMD): - Fix a few missing "VMCB dirty" bugs - Fix the worst of KVM's lack of EFER.LMSLE emulation - Add AVIC support for addressing 4k vCPUs in x2AVIC mode - Fix incorrect handling of selective CR0 writes when checking intercepts during emulation of L2 instructions - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on VMRUN and #VMEXIT - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft interrupt if the guest patched the underlying code after the VM-Exit, e.g. when Linux patches code with a temporary INT3 - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to userspace, and extend KVM "support" to all policy bits that don't require any actual support from KVM x86 (Intel): - Use the root role from kvm_mmu_page to construct EPTPs instead of the current vCPU state, partly as worthwhile cleanup, but mostly to pave the way for tracking per-root TLB flushes, and elide EPT flushes on pCPU migration if the root is clean from a previous flush - Add a few missing nested consistency checks - Rip out support for doing "early" consistency checks via hardware as the functionality hasn't been used in years and is no longer useful in general; replace it with an off-by-default module param to WARN if hardware fails a check that KVM does not perform - Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32] on VM-Enter - Misc cleanups - Overhaul the TDX code to address systemic races where KVM (acting on behalf of userspace) could inadvertantly trigger lock contention in the TDX-Module; KVM was either working around these in weird, ugly ways, or was simply oblivious to them (though even Yan's devilish selftests could only break individual VMs, not the host kernel) - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a TDX vCPU, if creating said vCPU failed partway through - Fix a few sparse warnings (bad annotation, 0 != NULL) - Use struct_size() to simplify copying TDX capabilities to userspace - Fix a bug where TDX would effectively corrupt user-return MSR values if the TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected Selftests: - Fix a math goof in mmu_stress_test when running on a single-CPU system/VM - Forcefully override ARCH from x86_64 to x86 to play nice with specifying ARCH=x86_64 on the command line - Extend a bunch of nested VMX to validate nested SVM as well - Add support for LA57 in the core VM_MODE_xxx macro, and add a test to verify KVM can save/restore nested VMX state when L1 is using 5-level paging, but L2 is not - Clean up the guest paging code in anticipation of sharing the core logic for nested EPT and nested NPT guest_memfd: - Add NUMA mempolicy support for guest_memfd, and clean up a variety of rough edges in guest_memfd along the way - Define a CLASS to automatically handle get+put when grabbing a guest_memfd from a memslot to make it harder to leak references - Enhance KVM selftests to make it easer to develop and debug selftests like those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs often result in hard-to-debug SIGBUS errors - Misc cleanups Generic: - Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for irqfd cleanup - Fix a goof in the dirty ring documentation - Fix choice of target for directed yield across different calls to kvm_vcpu_on_spin(); the function was always starting from the first vCPU instead of continuing the round-robin search" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits) KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2 KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX} KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected" KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot() KVM: arm64: Add endian casting to kvm_swap_s[12]_desc() KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n KVM: arm64: selftests: Add test for AT emulation KVM: arm64: nv: Expose hardware access flag management to NV guests KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW KVM: arm64: Implement HW access flag management in stage-1 SW PTW KVM: arm64: Propagate PTW errors up to AT emulation KVM: arm64: Add helper for swapping guest descriptor KVM: arm64: nv: Use pgtable definitions in stage-2 walk KVM: arm64: Handle endianness in read helper for emulated PTW KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW KVM: arm64: Call helper for reading descriptors directly KVM: arm64: nv: Advertise support for FEAT_XNX KVM: arm64: Teach ptdump about FEAT_XNX permissions KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions ...