| Age | Commit message (Collapse) | Author | Files | Lines |
|
Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h and rename them with
PERF_CAP prefix to keep consistent with other perf capabilities macros.
No functional change intended.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename the two helpers vmx_vmentry/vmexit_ctrl() to
vmx_get_initial_vmentry/vmexit_ctrl() to represent their real meaning.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Take a snapshot of the unadulterated PMU capabilities provided by perf so
that KVM can compare guest vPMU capabilities against hardware capabilities
when determining whether or not to intercept PMU MSRs (and RDPMC).
Reviewed-by: Sandipan Das <sandipan.das@amd.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Gate access to PMC MSRs based on pmu->version, not on kvm->arch.enable_pmu,
to more accurately reflect KVM's behavior. This is a glorified nop, as
pmu->version and pmu->nr_arch_gp_counters can only be non-zero if
amd_pmu_refresh() is reached, kvm_pmu_refresh() invokes amd_pmu_refresh()
if and only if kvm->arch.enable_pmu is true, and amd_pmu_refresh() forces
pmu->version to be 1 or 2.
I.e. the following holds true:
!pmu->nr_arch_gp_counters || kvm->arch.enable_pmu == (pmu->version > 0)
and so the only way for amd_pmu_get_pmc() to return a non-NULL value is if
both kvm->arch.enable_pmu and pmu->version evaluate to true.
No real functional change intended.
Reviewed-by: Sandipan Das <sandipan.das@amd.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Setup the golden VMCS config during vmx_init(), before the call to
kvm_x86_vendor_init(), instead of waiting until the callback to do
hardware setup. setup_vmcs_config() only touches VMX state, i.e. doesn't
poke anything in kvm.ko, and has no runtime dependencies beyond
hv_init_evmcs().
Setting the VMCS config early on will allow referencing VMCS and VMX
capabilities at any point during setup, e.g. to check for PERF_GLOBAL_CTRL
save/load support during mediated PMU initialization.
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly document that the behavior of KVM_SET_PIT2 strictly conforms
to the Intel 8254 PIT hardware specification, specifically that a write of
'0' adheres to the spec's definition that a programmed count of '0' is
converted to the maximum possible value (2^16). E.g. an unaware userspace
might attempt to validate that KVM_GET_PIT2 returns the exact state set
via KVM_SET_PIT2, and be surprised when the returned count is 65536, not 0.
Add a references to the Intel 8254 PIT datasheet that will hopefully stay
fresh for some time (the internet isn't exactly brimming with copies of
the 8254 datasheet).
Link: https://lore.kernel.org/all/CANypQFbEySjKOFLqtFFf2vrEe=NBr7XJfbkjQhqXuZGg7Rpoxw@mail.gmail.com
Signed-off-by: Jiaming Zhang <r772577952@gmail.com>
Link: https://lore.kernel.org/r/20250905174736.260694-1-r772577952@gmail.com
[sean: add context Link, drop local APIC change, massage changelog accordingly]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use guard(mutex) instead of mutex_lock/mutex_unlock pair to simplify the
error handling when setting up the TSC page for a Hyper-V guest.
No functional change intended.
Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Link: https://lore.kernel.org/r/20250901131604.646415-1-liaoyuanhong@vivo.com
[sean: tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use guard(mutex) instead of mutex_lock/mutex_unlock pair to simplify the
error handling when allocating the APIC access page.
No functional change intended.
Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Link: https://lore.kernel.org/r/20250901131822.647802-1-liaoyuanhong@vivo.com
[sean: add blank link to isolate guard(), tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fix typos. "_COUTNERS" -> "_COUNTERS".
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
Link: https://lore.kernel.org/r/20250718001905.196989-2-dapeng1.mi@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Reject KVM_CREATE_IRQCHIP if the VM type has protected EOIs, i.e. if KVM
can't intercept EOI and thus can't faithfully emulate level-triggered
interrupts that are routed through the I/O APIC. For TDX VMs, the
TDX-Module owns the VMX EOI-bitmap and configures all IRQ vectors to have
the CPU accelerate EOIs, i.e. doesn't allow KVM to intercept any EOIs.
KVM already requires a split irqchip[1], but does so during vCPU creation,
which is both too late to allow userspace to fallback to a split irqchip
and a less-than-stellar experience for userspace since an -EINVAL on
KVM_VCPU_CREATE is far harder to debug/triage than failure exactly on
KVM_CREATE_IRQCHIP. And of course, allowing an action that ultimately
fails is arguably a bug regardless of the impact on userspace.
Link: https://lore.kernel.org/lkml/20250222014757.897978-11-binbin.wu@linux.intel.com [1]
Link: https://lore.kernel.org/lkml/aK3vZ5HuKKeFuuM4@google.com
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sagi Shahar <sagis@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250827011726.2451115-1-sagis@google.com
[sean: massage shortlog+changelog, relocate setting has_protected_eoi]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the vector_hashing module param into lapic.c now that all usage is
contained within the local APIC emulation code.
Opportunistically drop the accessor and append "_enabled" to the variable
to help capture that it's a boolean module param.
No functional change intended.
Link: https://lore.kernel.org/r/20250821214209.3463350-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Make various helpers for resolving lowest priority IRQs local to lapic.c
now that kvm_irq_delivery_to_apic() lives in lapic.c as well.
No functional change intended.
Link: https://lore.kernel.org/r/20250821214209.3463350-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move kvm_irq_delivery_to_apic() to lapic.c as it is specific to local APIC
emulation. This will allow burying more local APIC code in lapic.c, e.g.
the various "lowest priority" helpers.
No functional change intended.
Link: https://lore.kernel.org/r/20250821214209.3463350-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Tweak the code a bit to facilitate resetting more xstate components in
the future, e.g., CET's xstate-managed MSRs.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-6-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't manually clear/zero MPX state on RESET, as the guest FPU state is
zero allocated and KVM only does RESET during vCPU creation, i.e. the
relevant state is guaranteed to be all zeroes.
Opportunistically move the relevant code into a helper in anticipation of
adding support for CET shadow stacks, which also has state that is zeroed
on INIT.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-5-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
helpers to replace existing usage of the raw functions.
kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
to get/set a MSR value for emulating CPU behavior, i.e., host_initiated ==
%true in the helpers.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-4-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use the double-underscore helpers for emulating MSR reads and writes in
he no-underscore versions to better capture the relationship between the
two sets of APIs (the double-underscore versions don't honor userspace MSR
filters).
No functional change intended.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-3-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename
kvm_{g,s}et_msr_with_filter()
kvm_{g,s}et_msr()
to
kvm_emulate_msr_{read,write}
__kvm_emulate_msr_{read,write}
to make it more obvious that KVM uses these helpers to emulate guest
behaviors, i.e., host_initiated == false in these helpers.
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-2-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Advertise support for the immediate form of MSR instructions to userspace
if the instructions are supported by the underlying CPU, and KVM is using
VMX, i.e. is running on an Intel-compatible CPU.
For SVM, explicitly clear X86_FEATURE_MSR_IMM to ensure KVM doesn't over-
report support if AMD-compatible CPUs ever implement the immediate forms,
as SVM will likely require explicit enablement in KVM.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add support for handling "WRMSRNS with an immediate" VM-Exits in KVM's
fastpath. On Intel, all writes to the x2APIC ICR and to the TSC Deadline
MSR are non-serializing, i.e. it's highly likely guest kernels will switch
to using WRMSRNS when possible. And in general, any MSR written via
WRMSRNS is probably worth handling in the fastpath, as the entire point of
WRMSRNS is to shave cycles in hot paths.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: rewrite changelog, split rename to separate patch]
Link: https://lore.kernel.org/r/20250805202224.1475590-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add support for the immediate forms of RDMSR and WRMSRNS (currently
Intel-only). The immediate variants are only valid in 64-bit mode, and
use a single general purpose register for the data (the register is also
encoded in the instruction, i.e. not implicit like regular RDMSR/WRMSR).
The immediate variants are primarily motivated by performance, not code
size: by having the MSR index in an immediate, it is available *much*
earlier in the CPU pipeline, which allows hardware much more leeway about
how a particular MSR is handled.
Intel VMX support for the immediate forms of MSR accesses communicates
exit information to the host as follows:
1) The immediate form of RDMSR uses VM-Exit Reason 84.
2) The immediate form of WRMSRNS uses VM-Exit Reason 85.
3) For both VM-Exit reasons 84 and 85, the Exit Qualification field is
set to the MSR index that triggered the VM-Exit.
4) Bits 3 ~ 6 of the VM-Exit Instruction Information field are set to
the register encoding used by the immediate form of the instruction,
i.e. the destination register for RDMSR, and the source for WRMSRNS.
5) The VM-Exit Instruction Length field records the size of the
immediate form of the MSR instruction.
To deal with userspace RDMSR exits, stash the destination register in a
new kvm_vcpu_arch field, similar to cui_linear_rip, pio, etc.
Alternatively, the register could be saved in kvm_run.msr or re-retrieved
from the VMCS, but the former would require sanitizing the value to ensure
userspace doesn't clobber the value to an out-of-bounds index, and the
latter would require a new one-off kvm_x86_ops hook.
Don't bother adding support for the instructions in KVM's emulator, as the
only way for RDMSR/WRMSR to be encountered is if KVM is emulating large
swaths of code due to invalid guest state, and a vCPU cannot have invalid
guest state while in 64-bit mode.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: minor tweaks, massage and expand changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename the WRMSR fastpath API to drop "irqoff", as that information is
redundant (the fastpath always runs with IRQs disabled), and to prepare
for adding a fastpath for the immediate variant of WRMSRNS.
No functional change intended.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: split to separate patch, write changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename "ecx" variables in {RD,WR}MSR and RDPMC helpers to "msr" and "pmc"
respectively, in anticipation of adding support for the immediate variants
of RDMSR and WRMSRNS, and to better document what the variables hold
(versus where the data originated).
No functional change intended.
Link: https://lore.kernel.org/r/20250805202224.1475590-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The immediate form of MSR access instructions are primarily motivated
by performance, not code size: by having the MSR number in an immediate,
it is available *much* earlier in the pipeline, which allows the
hardware much more leeway about how a particular MSR is handled.
Use a scattered CPU feature bit for MSR immediate form instructions.
Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250805202224.1475590-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a fastpath handler for INVD so that the common fastpath logic can be
trivially tested on both Intel and AMD. Under KVM, INVD is always:
(a) intercepted, (b) available to the guest, and (c) emulated as a nop,
with no side effects. Combined with INVD not having any inputs or outputs,
i.e. no register constraints, INVD is the perfect instruction for
exercising KVM's fastpath as it can be inserted into practically any
guest-side code stream.
Link: https://lore.kernel.org/r/20250805190526.1453366-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Acquire SRCU in the VM-Exit fastpath if and only if KVM needs to check the
PMU event filter, to further trim the amount of code that is executed with
SRCU protection in the fastpath. Counter-intuitively, holding SRCU can do
more harm than good due to masking potential bugs, and introducing a new
SRCU-protected asset to code reachable via kvm_skip_emulated_instruction()
would be quite notable, i.e. definitely worth auditing.
E.g. the primary user of kvm->srcu is KVM's memslots, accessing memslots
all but guarantees guest memory may be accessed, accessing guest memory
can fault, and page faults might sleep, which isn't allowed while IRQs are
disabled. Not acquiring SRCU means the (hypothetical) illegal sleep would
be flagged when running with PROVE_RCU=y, even if DEBUG_ATOMIC_SLEEP=n.
Note, performance is NOT a motivating factor, as SRCU lock/unlock only
adds ~15 cycles of latency to fastpath VM-Exits. I.e. overhead isn't a
concern _if_ SRCU protection needs to be extended beyond PMU events, e.g.
to honor userspace MSR filters.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename check_pmu_event_filter() to make its polarity more obvious, and to
connect the dots to is_gp_event_allowed() and is_fixed_event_allowed().
No functional change intended.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the check on a PMC being locally enabled when triggering emulated
events, as the bitmap of passed-in PMCs only contains locally enabled PMCs.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When triggering PMC events in response to emulation, drop the redundant
checks on a PMC being globally and locally enabled, as the passed in bitmap
contains only PMCs that are locally enabled (and counting the right event),
and the local copy of the bitmap has already been masked with global_ctrl.
No true functional change intended.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Open code pmc_event_is_allowed() in its callers, as kvm_pmu_trigger_event()
only needs to check the event filter (both global and local enables are
consulted outside of the loop).
No functional change intended.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename pmc_speculative_in_use() to pmc_is_locally_enabled() to better
capture what it actually tracks, and to show its relationship to
pmc_is_globally_enabled(). While neither AMD nor Intel refer to event
selectors or the fixed counter control MSR as "local", it's the obvious
name to pair with "global".
As for "speculative", there's absolutely nothing speculative about the
checks. E.g. for PMUs without PERF_GLOBAL_CTRL, from the guest's
perspective, the counters are "in use" without any qualifications.
No functional change intended.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Calculate and track PMCs that are counting instructions/branches retired
when the PMC's event selector (or fixed counter control) is modified
instead evaluating the event selector on-demand. Immediately recalc a
PMC's configuration on writes to avoid false negatives/positives when
KVM skips an emulated WRMSR, which is guaranteed to occur before the
main run loop processes KVM_REQ_PMU.
Out of an abundance of caution, and because it's relatively cheap, recalc
reprogrammed PMCs in kvm_pmu_handle_event() as well. Recalculating in
response to KVM_REQ_PMU _should_ be unnecessary, but for now be paranoid
to avoid introducing easily-avoidable bugs in edge cases. The code can be
removed in the future if necessary, e.g. in the unlikely event that the
overhead of recalculating to-be-emulated PMCs is noticeable.
Note! Deliberately don't check the PMU event filters, as doing so could
result in KVM consuming stale information.
Tracking which PMCs are counting branches/instructions will allow grabbing
SRCU in the fastpath VM-Exit handlers if and only if a PMC event might be
triggered (to consult the event filters), and will also allow the upcoming
mediated PMU to do the right thing with respect to counting instructions
(the mediated PMU won't be able to update PMCs in the VM-Exit fastpath).
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add wrappers for triggering instruction retired and branch retired PMU
events in anticipation of reworking the internal mechanisms to track
which PMCs need to be evaluated, e.g. to avoid having to walk and check
every PMC.
Opportunistically bury "struct kvm_pmu_emulated_event_selectors" in pmu.c.
No functional change intended.
Link: https://lore.kernel.org/r/20250805190526.1453366-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move kvm_init_pmu_capability() to pmu.c so that future changes can access
variables that have no business being visible outside of pmu.c.
kvm_init_pmu_capability() is called once per module load, there's is zero
reason it needs to be inlined.
No functional change intended.
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fold the per-MSR WRMSR fastpath helpers into the main handler now that the
IPI path in particular is relatively tiny. In addition to eliminating a
decent amount of boilerplate, this removes the ugly -errno/1/0 => bool
conversion (which is "necessitated" by kvm_x2apic_icr_write_fast()).
Opportunistically drop the comment about IPIs, as the purpose of the
fastpath is hopefully self-evident, and _if_ it needs more documentation,
the documentation (and rules!) should be placed in a more central location.
No functional change intended.
Link: https://lore.kernel.org/r/20250805190526.1453366-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Always grab EDX:EAX in the WRMSR fastpath to deduplicate and simplify the
case statements, and to prepare for handling immediate variants of WRMSRNS
in the fastpath (the data register is explicitly provided in that case).
There's no harm in reading the registers, as their values are always
available, i.e. don't require VMREADs (or similarly slow operations).
No real functional change intended.
Cc: Xin Li <xin@zytor.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Acquire SRCU in the WRMSR fastpath if and only if an instruction needs to
be skipped, i.e. only if the fastpath succeeds. The reasoning in commit
3f2739bd1e0b ("KVM: x86: Acquire SRCU read lock when handling fastpath MSR
writes") about "avoid having to play whack-a-mole" seems sound, but in
hindsight unconditionally acquiring SRCU does more harm than good.
While acquiring/releasing SRCU isn't slow per se, the things that are
_protected_ by kvm->srcu are generally safe to access only in the "slow"
VM-Exit path. E.g. accessing memslots in generic helpers is never safe,
because accessing guest memory with IRQs disabled is unless unsafe (except
when kvm_vcpu_read_guest_atomic() is used, but that API should never be
used in emulation helpers).
In other words, playing whack-a-mole is actually desirable in this case,
because every access to an asset protected by kvm->srcu warrants further
scrutiny.
Link: https://lore.kernel.org/r/20250805190526.1453366-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the fastpath VM-Exit requirement that KVM can use the hypervisor
timer to emulate the APIC timer in TSC deadline mode. I.e. unconditionally
handle MSR_IA32_TSC_DEADLINE WRMSRs in the fastpath. Restricting the
fastpath to *maybe* using the VMX preemption timer is ineffective and
unnecessary.
If the requested deadline can't be programmed into the VMX preemption
timer, KVM will fall back to hrtimers, i.e. the restriction is ineffective
as far as preventing any kind of worst case scenario.
But guarding against a worst case scenario is completely unnecessary as
the "slow" path, start_sw_tscdeadline() => hrtimer_start(), explicitly
disables IRQs. In fact, the worst case scenario is when KVM thinks it
can use the VMX preemption timer, as KVM will eat the overhead of calling
into vmx_set_hv_timer() and falling back to hrtimers.
Opportunistically limit kvm_can_use_hv_timer() to lapic.c as the fastpath
code was the only external user.
Stating the obvious, this allows handling MSR_IA32_TSC_DEADLINE writes in
the fastpath on AMD CPUs.
Link: https://lore.kernel.org/r/20250805190526.1453366-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the restrictions on fastpath IPIs only working for fixed IRQs with a
physical destination now that the fastpath is explicitly limited to "fast"
delivery. Limiting delivery to a single physical APIC ID guarantees only
one vCPU will receive the event, but that isn't necessary "fast", e.g. if
the targeted vCPU is the last of 4096 vCPUs. And logical destination mode
or shorthand (to self) can also be fast, e.g. if only a few vCPUs are
being targeted. Lastly, there's nothing inherently slow about delivering
an NMI, INIT, SIPI, SMI, etc., i.e. there's no reason to artificially
limit fastpath delivery to fixed vector IRQs.
Link: https://lore.kernel.org/r/20250805190526.1453366-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly restrict fastpath ICR writes to IPIs that are "fast", i.e. can
be delivered without having to walk all vCPUs, and that target at most 16
vCPUs. Artificially restricting ICR writes to physical mode guarantees
at most one vCPU will receive in IPI (because x2APIC IDs are read-only),
but that delivery might not be "fast". E.g. even if the vCPU exists, KVM
might have to iterate over 4096 vCPUs to find the right one.
Limiting delivery to fast IPIs aligns the WRMSR fastpath with
kvm_arch_set_irq_inatomic() (which also runs with IRQs disabled), and will
allow dropping the semi-arbitrary restrictions on delivery mode and type.
Link: https://lore.kernel.org/r/20250805190526.1453366-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Extract the code for converting an ICR message into a kvm_lapic_irq
structure into a local helper so that a fast-only IPI path can share the
conversion logic.
No functional change intended.
Link: https://lore.kernel.org/r/20250805190526.1453366-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use find_nth_bit() and make the function almost a one-liner.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Bypass the Centaur-only filter for the CPUID signature leaf so that
processing continues when the CPU vendor is Zhaoxin.
Signed-off-by: Ewan Hai <ewanhai-oc@zhaoxin.com>
Link: https://lore.kernel.org/r/20250818083034.93935-1-ewanhai-oc@zhaoxin.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The Free Software Foundation does not reside in "59 Temple Place"
anymore, so we should not mention that address in the source code here.
But instead of updating the address to their current location, let's
rather drop the license boilerplate text here and use a proper SPDX
license identifier instead. The text talks about the "GNU *Lesser*
General Public License" and "any later version", so LGPL-2.1+ is the
right choice here.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Link: https://lore.kernel.org/r/20250728152843.310260-1-thuth@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Skip the WRMSR and HLT fastpaths in SVM's VM-Exit handler if the next RIP
isn't valid, e.g. because KVM is running with nrips=false. SVM must
decode and emulate to skip the instruction if the CPU doesn't provide the
next RIP, and getting the instruction bytes to decode requires reading
guest memory. Reading guest memory through the emulator can fault, i.e.
can sleep, which is disallowed since the fastpath handlers run with IRQs
disabled.
BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:106
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 32611, name: qemu
preempt_count: 1, expected: 0
INFO: lockdep is turned off.
irq event stamp: 30580
hardirqs last enabled at (30579): [<ffffffffc08b2527>] vcpu_run+0x1787/0x1db0 [kvm]
hardirqs last disabled at (30580): [<ffffffffb4f62e32>] __schedule+0x1e2/0xed0
softirqs last enabled at (30570): [<ffffffffb4247a64>] fpu_swap_kvm_fpstate+0x44/0x210
softirqs last disabled at (30568): [<ffffffffb4247a64>] fpu_swap_kvm_fpstate+0x44/0x210
CPU: 298 UID: 0 PID: 32611 Comm: qemu Tainted: G U 6.16.0-smp--e6c618b51cfe-sleep #782 NONE
Tainted: [U]=USER
Hardware name: Google Astoria-Turin/astoria, BIOS 0.20241223.2-0 01/17/2025
Call Trace:
<TASK>
dump_stack_lvl+0x7d/0xb0
__might_resched+0x271/0x290
__might_fault+0x28/0x80
kvm_vcpu_read_guest_page+0x8d/0xc0 [kvm]
kvm_fetch_guest_virt+0x92/0xc0 [kvm]
__do_insn_fetch_bytes+0xf3/0x1e0 [kvm]
x86_decode_insn+0xd1/0x1010 [kvm]
x86_emulate_instruction+0x105/0x810 [kvm]
__svm_skip_emulated_instruction+0xc4/0x140 [kvm_amd]
handle_fastpath_invd+0xc4/0x1a0 [kvm]
vcpu_run+0x11a1/0x1db0 [kvm]
kvm_arch_vcpu_ioctl_run+0x5cc/0x730 [kvm]
kvm_vcpu_ioctl+0x578/0x6a0 [kvm]
__se_sys_ioctl+0x6d/0xb0
do_syscall_64+0x8a/0x2c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f479d57a94b
</TASK>
Note, this is essentially a reapply of commit 5c30e8101e8d ("KVM: SVM:
Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"), but with
different justification (KVM now grabs SRCU when skipping the instruction
for other reasons).
Fixes: b439eb8ab578 ("Revert "KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250805190526.1453366-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Emulate PERF_CNTR_GLOBAL_STATUS_SET when PerfMonV2 is enumerated to the
guest, as the MSR is supposed to exist in all AMD v2 PMUs.
Fixes: 4a2771895ca6 ("KVM: x86/svm/pmu: Add AMD PerfMonV2 support")
Cc: stable@vger.kernel.org
Cc: Sandipan Das <sandipan.das@amd.com>
Link: https://lore.kernel.org/r/20250711172746.1579423-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When completing emulation of instruction that generated a userspace exit
for I/O, don't recheck L1 intercepts as KVM has already finished that
phase of instruction execution, i.e. has already committed to allowing L2
to perform I/O. If L1 (or host userspace) modifies the I/O permission
bitmaps during the exit to userspace, KVM will treat the access as being
intercepted despite already having emulated the I/O access.
Pivot on EMULTYPE_NO_DECODE to detect that KVM is completing emulation.
Of the three users of EMULTYPE_NO_DECODE, only complete_emulated_io() (the
intended "recipient") can reach the code in question. gp_interception()'s
use is mutually exclusive with is_guest_mode(), and
complete_emulated_insn_gp() unconditionally pairs EMULTYPE_NO_DECODE with
EMULTYPE_SKIP.
The bad behavior was detected by a syzkaller program that toggles port I/O
interception during the userspace I/O exit, ultimately resulting in a WARN
on vcpu->arch.pio.count being non-zero due to KVM no completing emulation
of the I/O instruction.
WARNING: CPU: 23 PID: 1083 at arch/x86/kvm/x86.c:8039 emulator_pio_in_out+0x154/0x170 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 23 UID: 1000 PID: 1083 Comm: repro Not tainted 6.16.0-rc5-c1610d2d66b1-next-vm #74 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:emulator_pio_in_out+0x154/0x170 [kvm]
PKRU: 55555554
Call Trace:
<TASK>
kvm_fast_pio+0xd6/0x1d0 [kvm]
vmx_handle_exit+0x149/0x610 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0xda8/0x1ac0 [kvm]
kvm_vcpu_ioctl+0x244/0x8c0 [kvm]
__x64_sys_ioctl+0x8a/0xd0
do_syscall_64+0x5d/0xc60
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Reported-by: syzbot+cc2032ba16cc2018ca25@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68790db4.a00a0220.3af5df.0020.GAE@google.com
Fixes: 8a76d7f25f8f ("KVM: x86: Add x86 callback for intercept check")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250715190638.1899116-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
All CPUID call sites were updated at commit:
968e30006807 ("x86/cpuid: Set <asm/cpuid/api.h> as the main CPUID header")
to include <asm/cpuid/api.h> instead of <asm/cpuid.h>.
The <asm/cpuid.h> header was still retained as a wrapper, just in case
some new code in -next started using it. Now that everything is merged
to Linus' tree, remove the header.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250815070227.19981-2-darwi@linutronix.de
|
|
initialized to zero
In order to support future versions of the SVSM_CORE_PVALIDATE call, all
reserved fields within a PVALIDATE entry must be set to zero as an SVSM should
be ensuring all reserved fields are zero in order to support future usage of
reserved areas based on the protocol version.
Fixes: fcd042e86422 ("x86/sev: Perform PVALIDATE using the SVSM when not at VMPL0")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/7cde412f8b057ea13a646fb166b1ca023f6a5031.1755098819.git.thomas.lendacky@amd.com
|