linux/virt, branch v6.14

KVM: remove kvm_arch_post_init_vm

2025-02-04T16:27:45Z

The only statement in a kvm_arch_post_init_vm implementation can be moved into the x86 kvm_arch_init_vm. Do so and remove all traces from architecture-independent code. Signed-off-by: Paolo Bonzini

KVM: Do not restrict the size of KVM-internal memory regions

2025-01-31T11:03:52Z

Exempt KVM-internal memslots from the KVM_MEM_MAX_NR_PAGES restriction, as the limit on the number of pages exists purely to play nice with dirty bitmap operations, which use 32-bit values to index the bitmaps, and dirty logging isn't supported for KVM-internal memslots. Link: https://lore.kernel.org/all/20240802205003.353672-6-seanjc@google.com Signed-off-by: Sean Christopherson Reviewed-by: Christoph Schlameuss Reviewed-by: David Hildenbrand Link: https://lore.kernel.org/r/20250123144627.312456-2-imbrenda@linux.ibm.com Signed-off-by: Claudio Imbrenda Message-ID: <20250123144627.312456-2-imbrenda@linux.ibm.com>

Merge branch 'kvm-mirror-page-tables' into HEAD

2025-01-20T12:15:58Z

As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. Confidential computing solutions almost invariably have concepts of private and shared memory, but they may different a lot in the details. In SEV, for example, the bit is handled more like a permission bit as far as the page tables are concerned: the private/shared bit is not included in the physical address. For TDX, instead, the bit is more like a physical address bit, with the host mapping private memory in one half of the address space and shared in another. Furthermore, the two halves are mapped by different EPT roots and only the shared half is managed by KVM; the private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module via SEAMCALLs. As a result, the operations that actually change the private half of the EPT are limited and relatively slow compared to reading a PTE. For this reason the design for KVM is to keep a mirror of the private EPT in host memory. This allows KVM to quickly walk the EPT and only perform the slower private EPT operations when it needs to actually modify mid-level private PTEs. There are thus three sets of EPT page tables: external, mirror and direct. In the case of TDX (the only user of this framework) the first two cover private memory, whereas the third manages shared memory: external EPT - Hidden within the TDX module, modified via TDX module calls. mirror EPT - Bookkeeping tree used as an optimization by KVM, not used by the processor. direct EPT - Normal EPT that maps unencrypted shared memory. Managed like the EPT of a normal VM. Modifying external EPT ---------------------- Modifications to the mirrored page tables need to also perform the same operations to the private page tables, which will be handled via kvm_x86_ops. Although this prep series does not interact with the TDX module at all to actually configure the private EPT, it does lay the ground work for doing this. In some ways updating the private EPT is as simple as plumbing PTE modifications through to also call into the TDX module; however, the locking is more complicated because inserting a single PTE cannot anymore be done atomically with a single CMPXCHG. For this reason, the existing FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides protecting operation of KVM, it limits the set of cases in which the TDX module will encounter contention on its own PTE locks. Zapping external EPT -------------------- While the framework tries to be relatively generic, and to be understandable without knowing TDX much in detail, some requirements of TDX sometimes leak; for example the private page tables also cannot be zapped while the range has anything mapped, so the mirrored/private page tables need to be protected from KVM operations that zap any non-leaf PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast(). For normal VMs, guest memory is zapped for several reasons: user memory getting paged out by the guest, memslots getting deleted, passthrough of devices with non-coherent DMA. Confidential computing adds to these the conversion of memory between shared and privates. These operations must not zap any private memory that is in use by the guest. This is possible because the only zapping that is out of the control of KVM/userspace is paging out userspace memory, which cannot apply to guestmemfd operations. Thus a TDX VM will only zap private memory from memslot deletion and from conversion between private and shared memory which is triggered by the guest. To avoid zapping too much memory, enums are introduced so that operations can choose to target only private or shared memory, and thus only direct or mirror EPT. For example: Memslot deletion - Private and shared MMU notifier based zapping - Shared only Conversion to shared - Private only Conversion to private - Shared only Other cases of zapping will not be supported for KVM, for example APICv update or non-coherent DMA status update; for the latter, TDX will simply require that the CPU supports self-snoop and honor guest PAT unconditionally for shared memory.

Merge tag 'kvm-x86-vcpu_array-6.14' of https://github.com/kvm-x86/linux into HEAD

2025-01-20T11:36:40Z

KVM vcpu_array fixes and cleanups for 6.14: - Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug where KVM would return a pointer to a vCPU prior to it being fully online, and give kvm_for_each_vcpu() similar treatment to fix a similar flaw. - Wait for a vCPU to come online prior to executing a vCPU ioctl to fix a bug where userspace could coerce KVM into handling the ioctl on a vCPU that isn't yet onlined. - Gracefully handle xa_insert() failures even though such failuires should be impossible in practice.

KVM: Disallow all flags for KVM-internal memslots

2025-01-15T01:36:16Z

Disallow all flags for KVM-internal memslots as all existing flags require some amount of userspace interaction to have any meaning. In addition to guarding against KVM goofs, explicitly disallowing dirty logging of KVM- internal memslots will (hopefully) allow exempting KVM-internal memslots from the KVM_MEM_MAX_NR_PAGES limit, which appears to exist purely because the dirty bitmap operations use a 32-bit index. Cc: Xiaoyao Li Cc: Claudio Imbrenda Cc: Christian Borntraeger Reviewed-by: Xiaoyao Li Reviewed-by: Claudio Imbrenda Acked-by: Christoph Schlameuss Link: https://lore.kernel.org/r/20250111002022.1230573-6-seanjc@google.com Signed-off-by: Sean Christopherson

KVM: x86: Drop double-underscores from __kvm_set_memory_region()

2025-01-15T01:36:16Z

Now that there's no outer wrapper for __kvm_set_memory_region() and it's static, drop its double-underscore prefix. No functional change intended. Cc: Tao Su Reviewed-by: Xiaoyao Li Reviewed-by: Claudio Imbrenda Acked-by: Christoph Schlameuss Link: https://lore.kernel.org/r/20250111002022.1230573-5-seanjc@google.com Signed-off-by: Sean Christopherson

KVM: Add a dedicated API for setting KVM-internal memslots

2025-01-15T01:36:15Z

Add a dedicated API for setting internal memslots, and have it explicitly disallow setting userspace memslots. Setting a userspace memslots without a direct command from userspace would result in all manner of issues. No functional change intended. Cc: Tao Su Cc: Claudio Imbrenda Cc: Christian Borntraeger Reviewed-by: Xiaoyao Li Reviewed-by: Claudio Imbrenda Acked-by: Christoph Schlameuss Link: https://lore.kernel.org/r/20250111002022.1230573-4-seanjc@google.com Signed-off-by: Sean Christopherson

KVM: Assert slots_lock is held when setting memory regions

2025-01-15T01:36:15Z

Add proper lockdep assertions in __kvm_set_memory_region() and __x86_set_memory_region() instead of relying comments. Opportunistically delete __kvm_set_memory_region()'s entire function comment as the API doesn't allocate memory or select a gfn, and the "mostly for framebuffers" comment hasn't been true for a very long time. Cc: Tao Su Reviewed-by: Xiaoyao Li Reviewed-by: Claudio Imbrenda Acked-by: Christoph Schlameuss Link: https://lore.kernel.org/r/20250111002022.1230573-3-seanjc@google.com Signed-off-by: Sean Christopherson

KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)

2025-01-15T01:36:15Z

Open code kvm_set_memory_region() into its sole caller in preparation for adding a dedicated API for setting internal memslots. Oppurtunistically use the fancy new guard(mutex) to avoid a local 'r' variable. Cc: Tao Su Reviewed-by: Xiaoyao Li Reviewed-by: Claudio Imbrenda Acked-by: Christoph Schlameuss Link: https://lore.kernel.org/r/20250111002022.1230573-2-seanjc@google.com Signed-off-by: Sean Christopherson

KVM: Add member to struct kvm_gfn_range to indicate private/shared

2024-12-23T13:28:55Z

Add new members to strut kvm_gfn_range to indicate which mapping (private-vs-shared) to operate on: enum kvm_gfn_range_filter attr_filter. Update the core zapping operations to set them appropriately. TDX utilizes two GPA aliases for the same memslots, one for memory that is for private memory and one that is for shared. For private memory, KVM cannot always perform the same operations it does on memory for default VMs, such as zapping pages and having them be faulted back in, as this requires guest coordination. However, some operations such as guest driven conversion of memory between private and shared should zap private memory. Internally to the MMU, private and shared mappings are tracked on separate roots. Mapping and zapping operations will operate on the respective GFN alias for each root (private or shared). So zapping operations will by default zap both aliases. Add fields in struct kvm_gfn_range to allow callers to specify which aliases so they can only target the aliases appropriate for their specific operation. There was feedback that target aliases should be specified such that the default value (0) is to operate on both aliases. Several options were considered. Several variations of having separate bools defined such that the default behavior was to process both aliases. They either allowed nonsensical configurations, or were confusing for the caller. A simple enum was also explored and was close, but was hard to process in the caller. Instead, use an enum with the default value (0) reserved as a disallowed value. Catch ranges that didn't have the target aliases specified by looking for that specific value. Set target alias with enum appropriately for these MMU operations: - For KVM's mmu notifier callbacks, zap shared pages only because private pages won't have a userspace mapping - For setting memory attributes, kvm_arch_pre_set_memory_attributes() chooses the aliases based on the attribute. - For guest_memfd invalidations, zap private only. Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/ Signed-off-by: Isaku Yamahata Co-developed-by: Rick Edgecombe Signed-off-by: Rick Edgecombe Message-ID: <20240718211230.1492011-3-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini