| Age | Commit message (Collapse) | Author | Lines |
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
- gmap rewrite: completely new memory management for kvm/s390
- vSIE improvement
- maintainership change for s390 vfio-pci
- small quality of life improvement for protected guests
|
|
KVM x86 APIC-ish changes for 6.20
- Fix a benign bug where KVM could use the wrong memslots (ignored SMM) when
creating a vCPU-specific mapping of guest memory.
- Clean up KVM's handling of marking mapped vCPU pages dirty.
- Drop a pile of *ancient* sanity checks hidden behind in KVM's unused
ASSERT() macro, most of which could be trivially triggered by the guest
and/or user, and all of which were useless.
- Fold "struct dest_map" into its sole user, "struct rtc_status", to make it
more obvious what the weird parameter is used for, and to allow burying the
RTC shenanigans behind CONFIG_KVM_IOAPIC=y.
- Bury all of ioapic.h and KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y.
- Add a regression test for recent APICv update fixes.
- Rework KVM's handling of VMCS updates while L2 is active to temporarily
switch to vmcs01 instead of deferring the update until the next nested
VM-Exit. The deferred updates approach directly contributed to several
bugs, was proving to be a maintenance burden due to the difficulty in
auditing the correctness of deferred updates, and was polluting
"struct nested_vmx" with a growing pile of booleans.
- Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv()
to consolidate the updates, and to co-locate SVI updates with the updates
for KVM's own cache of ISR information.
- Drop a dead function declaration.
|
|
KVM/riscv changes for 6.20
- Fixes for issues discoverd by KVM API fuzzing in
kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(),
and kvm_riscv_vcpu_aia_imsic_update()
- Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM
- Add riscv vm satp modes in KVM selftests
- Transparent huge page support for G-stage
- Adjust the number of available guest irq files based on
MMIO register sizes in DeviceTree or ACPI
|
|
KVM SVM changes for 6.20
- Drop a user-triggerable WARN on nested_svm_load_cr3() failure.
- Add support for virtualizing ERAPS. Note, correct virtualization of ERAPS
relies on an upcoming, publicly announced change in the APM to reduce the
set of conditions where hardware (i.e. KVM) *must* flush the RAP.
- Ignore nSVM intercepts for instructions that are not supported according to
L1's virtual CPU model.
- Add support for expedited writes to the fast MMIO bus, a la VMX's fastpath
for EPT Misconfig.
- Don't set GIF when clearing EFER.SVME, as GIF exists independently of SVM,
and allow userspace to restore nested state with GIF=0.
- Treat exit_code as an unsigned 64-bit value through all of KVM.
- Add support for fetching SNP certificates from userspace.
- Fix a bug where KVM would use vmcb02 instead of vmcb01 when emulating VMLOAD
or VMSAVE on behalf of L2.
- Misc fixes and cleanups.
|
|
KVM selftests changes for 6.20
- Add a regression test for TPR<=>CR8 synchronization and IRQ masking.
- Overhaul selftest's MMU infrastructure to genericize stage-2 MMU support,
and extend x86's infrastructure to support EPT and NPT (for L2 guests).
- Extend several nested VMX tests to also cover nested SVM.
- Add a selftest for nested VMLOAD/VMSAVE.
- Rework the nested dirty log test, originally added as a regression test for
PML where KVM logged L2 GPAs instead of L1 GPAs, to improve test coverage
and to hopefully make the test easier to understand and maintain.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 7.0
- Add support for FEAT_IDST, allowing ID registers that are not
implemented to be reported as a normal trap rather than as an UNDEF
exception.
- Add sanitisation of the VTCR_EL2 register, fixing a number of
UXN/PXN/XN bugs in the process.
- Full handling of RESx bits, instead of only RES0, and resulting in
SCTLR_EL2 being added to the list of sanitised registers.
- More pKVM fixes for features that are not supposed to be exposed to
guests.
- Make sure that MTE being disabled on the pKVM host doesn't give it
the ability to attack the hypervisor.
- Allow pKVM's host stage-2 mappings to use the Force Write Back
version of the memory attributes by using the "pass-through'
encoding.
- Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
guest.
- Preliminary work for guest GICv5 support.
- A bunch of debugfs fixes, removing pointless custom iterators stored
in guest data structures.
- A small set of FPSIMD cleanups.
- Selftest fixes addressing the incorrect alignment of page
allocation.
- Other assorted low-impact fixes and spelling fixes.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.20
1. Add more CPUCFG mask bits.
2. Improve feature detection.
3. Add FPU/LBT delay load support.
4. Set default return value in KVM IO bus ops.
5. Add paravirt preempt feature support.
6. Add KVM steal time test case for tools/selftests.
|
|
The KVM RISC-V allows Zalasr extensions for Guest/VM so add this
extension to get-reg-list test.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20251020042904.32096-1-luxu.kernel@bytedance.com
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
Current vm modes cannot represent riscv guest modes precisely, here add
all 9 combinations of P(56,40,41) x V(57,48,39). Also the default vm
mode is detected on runtime instead of hardcoded one, which might not be
supported on specific machine.
Signed-off-by: Wu Fei <wu.fei9@sanechips.com.cn>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Nutty Liu <nutty.liu@hotmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20251105151442.28767-1-wu.fei9@sanechips.com.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
The KVM RISC-V allows Zilsd and Zclsd extensions for Guest/VM so add
this extension to get-reg-list test.
Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
Reviewed-by: Nutty Liu <nutty.liu@hotmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20250826162939.1494021-6-pincheng.plct@isrc.iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
LoongArch KVM supports steal time accounting now, here add steal time
test case on LoongArch.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
* kvm-arm64/resx:
: .
: Add infrastructure to deal with the full gamut of RESx bits
: for NV. As a result, it is now possible to have the expected
: semantics for some bits such as SCTLR_EL2.SPAN.
: .
KVM: arm64: Add debugfs file dumping computed RESx values
KVM: arm64: Add sanitisation to SCTLR_EL2
KVM: arm64: Remove all traces of HCR_EL2.MIOCNCE
KVM: arm64: Remove all traces of FEAT_TME
KVM: arm64: Simplify handling of full register invalid constraint
KVM: arm64: Get rid of FIXED_VALUE altogether
KVM: arm64: Simplify handling of HCR_EL2.E2H RESx
KVM: arm64: Move RESx into individual register descriptors
KVM: arm64: Add RES1_WHEN_E2Hx constraints as configuration flags
KVM: arm64: Add REQUIRES_E2H1 constraint as configuration flags
KVM: arm64: Simplify FIXED_VALUE handling
KVM: arm64: Convert HCR_EL2.RW to AS_RES1
KVM: arm64: Correctly handle SCTLR_EL1 RES1 bits for unsupported features
KVM: arm64: Allow RES1 bits to be inferred from configuration
KVM: arm64: Inherit RESx bits from FGT register descriptors
KVM: arm64: Extend unified RESx handling to runtime sanitisation
KVM: arm64: Introduce data structure tracking both RES0 and RES1 bits
KVM: arm64: Introduce standalone FGU computing primitive
KVM: arm64: Remove duplicate configuration for SCTLR_EL1.{EE,E0E}
arm64: Convert SCTLR_EL2 to sysreg infrastructure
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
FEAT_TME has been dropped from the architecture. Retrospectively.
I'm sure someone is crying somewhere, but most of us won't.
Clean-up time.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-18-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
This test allows to test the various storage key handling functions.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Some distributions (such as Ubuntu) configure GCC so that
_FORTIFY_SOURCE is automatically enabled at -O1 or above. This results
in some fortified version of definitions of standard library functions
are included. While linker resolves the symbols, the fortified versions
might override the definitions in lib/string_override.c and reference to
those PLT entries in GLIBC. This is not a problem for the code in host,
but it is a disaster for the guest code. E.g., if build and run
x86/nested_emulation_test on Ubuntu 24.04 will encounter a L1 #PF due to
memset() reference to __memset_chk@plt.
The option -fno-builtin-memset is not helpful here, because those
fortified versions are not built-in but some definitions which are
included by header, they are for different intentions.
In order to eliminate the unpredictable behaviors may vary depending on
the linker and platform, add the "-U_FORTIFY_SOURCE" into CFLAGS to
prevent from introducing the fortified definitions.
Signed-off-by: Zhiquan Li <zhiquan_li@163.com>
Link: https://patch.msgid.link/20260122053551.548229-1-zhiquan_li@163.com
Fixes: 6b6f71484bf4 ("KVM: selftests: Implement memcmp(), memcpy(), and memset() for guest use")
Cc: stable@vger.kernel.org
[sean: tag for stable]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
* kvm-arm64/feat_idst:
: .
: Add support for FEAT_IDST, allowing ID registers that are not implemented
: to be reported as a normal trap rather than as an UNDEF exception.
: .
KVM: arm64: selftests: Add a test for FEAT_IDST
KVM: arm64: pkvm: Report optional ID register traps with a 0x18 syndrome
KVM: arm64: pkvm: Add a generic synchronous exception injection primitive
KVM: arm64: Force trap of GMID_EL1 when the guest doesn't have MTE
KVM: arm64: Handle CSSIDR2_EL1 and SMIDR_EL1 in a generic way
KVM: arm64: Handle FEAT_IDST for sysregs without specific handlers
KVM: arm64: Add a generic synchronous exception injection primitive
KVM: arm64: Add trap routing for GMID_EL1
arm64: Repaint ID_AA64MMFR2_EL1.IDS description
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Update the nested dirty log test to validate KVM's handling of READ faults
when dirty logging is enabled. Specifically, set the Dirty bit in the
guest PTEs used to map L2 GPAs, so that KVM will create writable SPTEs
when handling L2 read faults. When handling read faults in the shadow MMU,
KVM opportunistically creates a writable SPTE if the mapping can be
writable *and* the gPTE is dirty (or doesn't support the Dirty bit), i.e.
if KVM doesn't need to intercept writes in order to emulate Dirty-bit
updates.
To actually test the L2 READ=>WRITE sequence, e.g. without masking a false
pass by other test activity, route the READ=>WRITE and WRITE=>WRITE
sequences to separate L1 pages, and differentiate between "marked dirty
due to a WRITE access/fault" and "marked dirty due to creating a writable
SPTE for a READ access/fault". The updated sequence exposes the bug fixed
by KVM commit 1f4e5fc83a42 ("KVM: x86: fix nested guest live migration
with PML") when the guest performs a READ=>WRITE sequence with dirty guest
PTEs.
Opportunistically tweak and rename the address macros, and add comments,
to make it more obvious what the test is doing. E.g. NESTED_TEST_MEM1
vs. GUEST_TEST_MEM doesn't make it all that obvious that the test is
creating aliases in both the L2 GPA and GVA address spaces, but only when
L1 is using TDP to run L2.
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260115172154.709024-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fix minor documentation errors in `kvm_util.h` and `kvm_util.c`.
- Correct the argument description for `vcpu_args_set` in `kvm_util.h`,
which incorrectly listed `vm` instead of `vcpu`.
- Fix a typo in the comment for `kvm_selftest_arch_init` ("exeucting" ->
"executing").
- Correct the return value description for `vm_vaddr_unused_gap` in
`kvm_util.c` to match the implementation, which returns an address "at
or above" `vaddr_min`, not "at or below".
No functional change intended.
Reviewed-by: Andrew Jones <andrew.jones@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260109082218.3236580-6-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
To avoid code duplication, move page_align() to the shared `kvm_util.h`
header file. Rename it to vm_page_align(), to make it clear that the
alignment is done with respect to the guest's base page size.
No functional change intended.
Reviewed-by: Andrew Jones <andrew.jones@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260109082218.3236580-5-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The implementation of `page_align()` in `processor.c` calculates
alignment incorrectly for values that are already aligned. Specifically,
`(v + vm->page_size) & ~(vm->page_size - 1)` aligns to the *next* page
boundary even if `v` is already page-aligned, potentially wasting a page
of memory.
Fix the calculation to use standard alignment logic: `(v + vm->page_size
- 1) & ~(vm->page_size - 1)`.
Fixes: 3e06cdf10520 ("KVM: selftests: Add initial support for RISC-V 64-bit")
Reviewed-by: Andrew Jones <andrew.jones@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260109082218.3236580-4-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The implementation of `page_align()` in `processor.c` calculates
alignment incorrectly for values that are already aligned. Specifically,
`(v + vm->page_size) & ~(vm->page_size - 1)` aligns to the *next* page
boundary even if `v` is already page-aligned, potentially wasting a page
of memory.
Fix the calculation to use standard alignment logic: `(v + vm->page_size
- 1) & ~(vm->page_size - 1)`.
Fixes: 7a6629ef746d ("kvm: selftests: add virt mem support for aarch64")
Reviewed-by: Andrew Jones <andrew.jones@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260109082218.3236580-3-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
KVM selftests map all guest code and data into the lower virtual address
range (0x0000...) managed by TTBR0_EL1. The upper range (0xFFFF...)
managed by TTBR1_EL1 is unused and uninitialized.
If a guest accesses the upper range, the MMU attempts a translation
table walk using uninitialized registers, leading to unpredictable
behavior.
Set `TCR_EL1.EPD1` to disable translation table walks for TTBR1_EL1,
ensuring that any access to the upper range generates an immediate
Translation Fault. Additionally, set `TCR_EL1.TBI1` (Top Byte Ignore) to
ensure that tagged pointers in the upper range also deterministically
trigger a Translation Fault via EPD1.
Define `TCR_EPD1_MASK`, `TCR_EPD1_SHIFT`, and `TCR_TBI1` in
`processor.h` to support this configuration. These are based on their
definitions in `arch/arm64/include/asm/pgtable-hwdef.h`.
Suggested-by: Will Deacon <will@kernel.org>
Reviewed-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260109082218.3236580-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Add a very basic test checking that FEAT_IDST actually works for
the {GMID,SMIDR,CSSIDR2}_EL1 registers.
Link: https://patch.msgid.link/20260108173233.2911955-10-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Add a test for VMLOAD/VMSAVE in an L2 guest. The test verifies that L1
intercepts for VMSAVE/VMLOAD always work regardless of
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK.
Then, more interestingly, it makes sure that when L1 does not intercept
VMLOAD/VMSAVE, they work as intended in L2. When
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK is enabled by L1, VMSAVE/VMLOAD from
L2 should interpret the GPA as an L2 GPA and translate it through the
NPT. When VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK is disabled by L1,
VMSAVE/VMLOAD from L2 should interpret the GPA as an L1 GPA.
To test this, put two VMCBs (0 and 1) in L1's physical address space,
and have a single L2 GPA where:
- L2 VMCB GPA == L1 VMCB(0) GPA
- L2 VMCB GPA maps to L1 VMCB(1) via the NPT in L1.
This setup allows detecting how the GPA is interpreted based on which L1
VMCB is actually accessed.
In both cases, L2 sets KERNEL_GS_BASE (one of the fields handled by
VMSAVE/VMLOAD), and executes VMSAVE to write its value to the VMCB. The
test userspace code then checks that the write was made to the correct
VMCB (based on whether VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK is set by L1),
and writes a new value to that VMCB. L2 then executes VMLOAD to load the
new value and makes sure it's reflected correctly in KERNERL_GS_BASE.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260110004821.3411245-4-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Instead of calling memstress_setup_ept_mappings() only in the first
iteration in the loop, move it before the loop.
The call needed to happen within the loop before commit e40e72fec0de
("KVM: selftests: Stop passing VMX metadata to TDP mapping functions"),
as memstress_setup_ept_mappings() used to take in a pointer to vmx_pages
and pass it into tdp_identity_map_1g() (to get the EPT root GPA). This
is no longer the case, as tdp_identity_map_1g() gets the EPT root
through stage2 MMU.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260113171456.2097312-1-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fix KVM's long-standing buggy handling of SVM's exit_code as a 32-bit
value. Per the APM and Xen commit d1bd157fbc ("Big merge the HVM
full-virtualisation abstractions.") (which is arguably more trustworthy
than KVM), offset 0x70 is a single 64-bit value:
070h 63:0 EXITCODE
Track exit_code as a single u64 to prevent reintroducing bugs where KVM
neglects to correctly set bits 63:32.
Fixes: 6aa8b732ca01 ("[PATCH] kvm: userspace interface")
Cc: Jim Mattson <jmattson@google.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230211347.4099600-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a test to verify KVM correctly handles a variety of edge cases related
to APICv updates, and in particular updates that are triggered while L2 is
actively running.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20260109034532.1012993-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rework the AMX test's #NM handling to use kvm_asm_safe() to verify an #NM
actually occurs. As is, a completely missing #NM could go unnoticed.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The host is allowed to set FPU state that includes a disabled
xstate component. Check that this does not cause bad effects.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Rework the guest=>host syncs in the AMX test to use named actions instead
of arbitrary, incrementing numbers. The "stage" of the test has no real
meaning, what matters is what action the test wants the host to perform.
The incrementing numbers are somewhat helpful for triaging failures, but
fully debugging failures almost always requires a much deeper dive into
the test (and KVM).
Using named actions not only makes it easier to extend the test without
having to shift all sync point numbers, it makes the code easier to read.
[Commit message by Sean Christopherson]
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add test cases for the validation checks in svm_set_nested_state(), and
allow the test to run with SVM as well as VMX. The SVM test also makes
sure that KVM_SET_NESTED_STATE accepts GIF being set or cleared if
EFER.SVME is cleared, verifying a recently fixed bug where GIF was
incorrectly expected to always be set when EFER.SVME is cleared.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251121204803.991707-5-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The assert messages do not add much value, so use TEST_ASSERT_EQ(),
which also nicely displays the addresses in hex. While at it, also
assert the values of state->flags.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251121204803.991707-4-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Shorten the API to get a PTE as the "PTE" acronym is ubiquitous, and the
"page table entry" makes it unnecessarily difficult to quickly understand
what callers are doing.
No functional change intended.
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add L1 SVM code and generalize the setup code to work for both VMX and
SVM. This allows running 'dirty_log_perf_test -n' on AMD CPUs.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Generalize the code in vmx_dirty_log_test.c by adding SVM-specific L1
code, doing some renaming (e.g. EPT -> TDP), and having setup code for
both SVM and VMX in test_dirty_log().
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
According to the APM, NPT walks are treated as user accesses. In
preparation for supporting NPT mappings, set the 'user' bit on NPTs by
adding a mask of bits to always be set on PTEs in kvm_mmu.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Implement nCR3 and NPT initialization functions, similar to the EPT
equivalents, and create common TDP helpers for enablement checking and
initialization. Enable NPT for nested guests by default if the TDP MMU
was initialized, similar to VMX.
Reuse the PTE masks from the main MMU in the NPT MMU, except for the C
and S bits related to confidential VMs.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-17-seanjc@google.com
[sean: apply Yosry's fixup for ncr3_gpa]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In preparation for generalizing the nested dirty logging test, checking
if either EPT or NPT is enabled will be needed. To avoid needing to gate
the kvm_cpu_has_ept() call by the CPU type, make sure the function
returns false if VMX is not available instead of trying to read VMX-only
MSRs.
No functional change intended.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that the functions are no longer VMX-specific, move them to
processor.c. Do a minor comment tweak replacing 'EPT' with 'TDP'.
No functional change intended.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rework tdp_map() and friends to use __virt_pg_map() and drop the custom
EPT code in __tdp_pg_map() and tdp_create_pte(). The EPT code and
__virt_pg_map() are practically identical, the main differences are:
- EPT uses the EPT struct overlay instead of the PTE masks.
- EPT always assumes 4-level EPTs.
To reuse __virt_pg_map(), extend the PTE masks to work with EPT's RWX and
X-only capabilities, and provide a tdp_mmu_init() API so that EPT can pass
in the EPT PTE masks along with the root page level (which is currently
hardcoded to '4').
Don't reuse KVM's insane overloading of the USER bit for EPT_R as there's
no reason to multiplex bits in the selftests, e.g. selftests aren't trying
to shadow guest PTEs and thus don't care about funnelling protections into
a common permissions check.
Another benefit of reusing the code is having separate handling for
upper-level PTEs vs 4K PTEs, which avoids some quirks like setting the
large bit on a 4K PTE in the EPTs.
For all intents and purposes, no functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20251230230150.4150236-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a stage-2 MMU instance so that architectures that support nested
virtualization (more specifically, nested stage-2 page tables) can create
and track stage-2 page tables for running L2 guests. Plumb the structure
into common code to avoid cyclical dependencies, and to provide some line
of sight to having common APIs for creating stage-2 mappings.
As a bonus, putting the member in common code justifies using stage2_mmu
instead of tdp_mmu for x86.
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The root GPA is now retrieved from the nested MMU, stop passing VMX
metadata. This is in preparation for making these functions work for
NPTs as well.
Opportunistically drop tdp_pg_map() since it's unused.
No functional change intended.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
prepare_eptp() currently allocates new EPTs for each vCPU. memstress has
its own hack to share the EPTs between vCPUs. Currently, there is no
reason to have separate EPTs for each vCPU, and the complexity is
significant. The only reason it doesn't matter now is because memstress
is the only user with multiple vCPUs.
Add vm_enable_ept() to allocate EPT page tables for an entire VM, and use
it everywhere to replace prepare_eptp(). Drop 'eptp' and 'eptp_hva' from
'struct vmx_pages' as they serve no purpose (e.g. the EPTP can be built
from the PGD), but keep 'eptp_gpa' so that the MMU structure doesn't need
to be passed in along with vmx_pages. Dynamically allocate the TDP MMU
structure to avoid a cyclical dependency between kvm_util_arch.h and
kvm_util.h.
Remove the workaround in memstress to copy the EPT root between vCPUs
since that's now the default behavior.
Name the MMU tdp_mmu instead of e.g. nested_mmu or nested.mmu to avoid
recreating the same mess that KVM has with respect to "nested" MMUs, e.g.
does nested refer to the stage-2 page tables created by L1, or the stage-1
page tables created by L2?
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20251230230150.4150236-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the PTE bitmasks into kvm_mmu to parameterize them for virt mapping
functions. Introduce helpers to read/write different PTE bits given a
kvm_mmu.
Drop the 'global' bit definition as it's currently unused, but leave the
'user' bit as it will be used in coming changes. Opportunisitcally
rename 'large' to 'huge' as it's more consistent with the kernel naming.
Leave PHYSICAL_PAGE_MASK alone, it's fixed in all page table formats and
a lot of other macros depend on it. It's tempting to move all the other
macros to be per-struct instead, but it would be too much noise for
little benefit.
Keep c_bit and s_bit in vm->arch as they used before the MMU is
initialized, through __vmcreate() -> vm_userspace_mem_region_add() ->
vm_mem_add() -> vm_arch_has_protected_memory().
No functional change intended.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
[sean: rename accessors to is_<adjective>_pte()]
Link: https://patch.msgid.link/20251230230150.4150236-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add an arch structure+field in "struct kvm_mmu" so that architectures can
track arch-specific information for a given MMU.
No functional change intended.
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In preparation for generalizing the x86 virt mapping APIs to work with
TDP (stage-2) page tables, plumb "struct kvm_mmu" into all of the helper
functions instead of operating on vm->mmu directly.
Opportunistically swap the order of the check in virt_get_pte() to first
assert that the parent is the PGD, and then check that the PTE is present,
as it makes more sense to check if the parent PTE is the PGD/root (i.e.
not a PTE) before checking that the PTE is PRESENT.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
[sean: rebase on common kvm_mmu structure, rewrite changelog]
Link: https://patch.msgid.link/20251230230150.4150236-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a "struct kvm_mmu" to track a given MMU instance, e.g. a VM's stage-1
MMU versus a VM's stage-2 MMU, so that x86 can share MMU functionality for
both stage-1 and stage-2 MMUs, without creating the potential for subtle
bugs, e.g. due to consuming on vm->pgtable_levels when operating a stage-2
MMU.
Encapsulate the existing de facto MMU in "struct kvm_vm", e.g instead of
burying the MMU details in "struct kvm_vm_arch", to avoid more #ifdefs in
____vm_create(), and in the hopes that other architectures can utilize the
formalized MMU structure if/when they too support stage-2 page tables.
No functional change intended.
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Stop setting Accessed/Dirty bits when creating EPT entries for L2 so that
the stage-1 and stage-2 (a.k.a. TDP) page table APIs can use common code
without bleeding the EPT hack into the common APIs.
While commit 094444204570 ("selftests: kvm: add test for dirty logging
inside nested guests") is _very_ light on details, the most likely
explanation is that vmx_dirty_log_test was attempting to avoid taking an
EPT Violation on the first _write_ from L2.
static void l2_guest_code(u64 *a, u64 *b)
{
READ_ONCE(*a);
WRITE_ONCE(*a, 1); <===
GUEST_SYNC(true);
...
}
When handling read faults in the shadow MMU, KVM opportunistically creates
a writable SPTE if the mapping can be writable *and* the gPTE is dirty (or
doesn't support the Dirty bit), i.e. if KVM doesn't need to intercept
writes in order to emulate Dirty-bit updates. By setting A/D bits in the
test's EPT entries, the above READ+WRITE will fault only on the read, and
in theory expose the bug fixed by KVM commit 1f4e5fc83a42 ("KVM: x86: fix
nested guest live migration with PML"). If the Dirty bit is NOT set, the
test will get a false pass due; though again, in theory.
However, the test is flawed (and always was, at least in the versions
posted publicly), as KVM (correctly) marks the corresponding L1 GFN as
dirty (in the dirty bitmap) when creating the writable SPTE. I.e. without
a check on the dirty bitmap after the READ_ONCE(), the check after the
first WRITE_ONCE() will get a false pass due to the dirty bitmap/log having
been updated by the read fault, not by PML.
Furthermore, the subsequent behavior in the test's l2_guest_code()
effectively hides the flawed test behavior, as the straight writes to a
new L2 GPA fault also trigger the KVM bug, and so the test will still
detect the failure due to lack of isolation between the two testcases
(Read=>Write vs. Write=>Write).
WRITE_ONCE(*b, 1);
GUEST_SYNC(true);
WRITE_ONCE(*b, 1);
GUEST_SYNC(true);
GUEST_SYNC(false);
Punt on fixing vmx_dirty_log_test for the moment as it will be easier to
properly fix the test once the TDP code uses the common MMU APIs, at which
point it will be trivially easy for the test to retrieve the EPT PTE and
set the Dirty bit as needed.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
[sean: rewrite changelog to explain the situation]
Link: https://patch.msgid.link/20251230230150.4150236-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Replace the struct overlay with explicit bitmasks, which is clearer and
less error-prone. See commit f18b4aebe107 ("kvm: selftests: do not use
bitfields larger than 32-bits for PTEs") for an example of why bitfields
are not preferable.
Remove the unused PAGE_SHIFT_4K definition while at it.
No functional change intended.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename the functions from nested_* to tdp_* to make their purpose
clearer.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230230150.4150236-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|