linux/drivers/idle, branch v6.7

intel_idle: Add ibrs_off module parameter to force-disable IBRS

2023-10-07T09:33:28Z

Commit bf5835bcdb96 ("intel_idle: Disable IBRS during long idle") disables IBRS when the cstate is 6 or lower. However, there are some use cases where a customer may want to use max_cstate=1 to lower latency. Such use cases will suffer from the performance degradation caused by the enabling of IBRS in the sibling idle thread. Add a "ibrs_off" module parameter to force disable IBRS and the CPUIDLE_FLAG_IRQ_ENABLE flag if set. In the case of a Skylake server with max_cstate=1, this new ibrs_off option will likely increase the IRQ response latency as IRQ will now be disabled. When running SPECjbb2015 with cstates set to C1 on a Skylake system. First test when the kernel is booted with: "intel_idle.ibrs_off": max-jOPS = 117828, critical-jOPS = 66047 Then retest when the kernel is booted without the "intel_idle.ibrs_off" added: max-jOPS = 116408, critical-jOPS = 58958 That means booting with "intel_idle.ibrs_off" improves performance by: max-jOPS: +1.2%, which could be considered noise range. critical-jOPS: +12%, which is definitely a solid improvement. The admin-guide/pm/intel_idle.rst file is updated to add a description about the new "ibrs_off" module parameter. Signed-off-by: Waiman Long Signed-off-by: Ingo Molnar Acked-by: Rafael J. Wysocki Cc: Linus Torvalds Link: https://lore.kernel.org/r/20230727184600.26768-5-longman@redhat.com

intel_idle: Use __update_spec_ctrl() in intel_idle_ibrs()

2023-10-07T09:33:28Z

When intel_idle_ibrs() is called, it modifies the SPEC_CTRL MSR to 0 in order disable IBRS. However, the new MSR value isn't reflected in x86_spec_ctrl_current which is at odd with the other code that keep track of its state in that percpu variable. Use the new __update_spec_ctrl() to have the x86_spec_ctrl_current percpu value properly updated. Since spec-ctrl.h includes both msr.h and nospec-branch.h, we can remove those from the include file list. Signed-off-by: Waiman Long Signed-off-by: Ingo Molnar Acked-by: Rafael J. Wysocki Cc: Linus Torvalds Link: https://lore.kernel.org/r/20230727184600.26768-4-longman@redhat.com

Merge tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2023-08-28T23:35:01Z

Pull perf event updates from Ingo Molnar: - AMD IBS improvements - Intel PMU driver updates - Extend core perf facilities & the ARM PMU driver to better handle ARM big.LITTLE events - Micro-optimize software events and the ring-buffer code - Misc cleanups & fixes * tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/uncore: Remove unnecessary ?: operator around pcibios_err_to_errno() call perf/x86/intel: Add Crestmont PMU x86/cpu: Update Hybrids x86/cpu: Fix Crestmont uarch x86/cpu: Fix Gracemont uarch perf: Remove unused extern declaration arch_perf_get_page_size() perf: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability arm_pmu: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability perf/x86: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability perf/x86/ibs: Set mem_lvl_num, mem_remote and mem_hops for data_src perf/mem: Add PERF_MEM_LVLNUM_NA to PERF_MEM_NA perf/mem: Introduce PERF_MEM_LVLNUM_UNC perf/ring_buffer: Use local_try_cmpxchg in __perf_output_begin locking/arch: Avoid variable shadowing in local_try_cmpxchg() perf/core: Use local64_try_cmpxchg in perf_swevent_set_period perf/x86: Use local64_try_cmpxchg perf/amd: Prevent grouping of IBS events

x86/cpu: Fix Gracemont uarch

2023-08-09T19:51:06Z

Alderlake N is an E-core only product using Gracemont micro-architecture. It fits the pre-existing naming scheme perfectly fine, adhere to it. Signed-off-by: Peter Zijlstra (Intel) Acked-by: Rafael J. Wysocki Acked-by: Hans de Goede Link: https://lore.kernel.org/r/20230807150405.686834933@infradead.org

Revert "intel_idle: Add support for using intel_idle in a VM guest using just hlt"

2023-07-19T18:10:03Z

This reverts commit 2f3d08f074b0 ("intel_idle: Add support for using intel_idle in a VM guest using just hlt"), because it causes functional issues to appear and it is not really useful without a related commit that got reverted previously. Link: https://lore.kernel.org/linux-pm/5c7de6d5-7706-c4a5-7c41-146db1269aff@intel.com Reported-by: Xiaoyao Li Requested-by: Peter Zijlstra Signed-off-by: Rafael J. Wysocki

Revert "intel_idle: Add a "Long HLT" C1 state for the VM guest mode"

2023-07-19T18:05:43Z

This reverts commit 0fac214bb75e ("intel_idle: Add a "Long HLT" C1 state for the VM guest mode"), because there is a coding mistake in it and its validity is questioned. Link: https://lore.kernel.org/all/20230711132553.GN3062772@hirez.programming.kicks-ass.net Requested-by: Peter Zijlstra Signed-off-by: Rafael J. Wysocki

Revert "intel_idle: Add __init annotation to matchup_vm_state_with_baremetal()"

2023-07-19T17:56:08Z

This reverts commit b2918089d5cb ("intel_idle: Add __init annotation to matchup_vm_state_with_baremetal()"), because the commit fixed by it will be reverted. Signed-off-by: Rafael J. Wysocki

intel_idle: Add __init annotation to matchup_vm_state_with_baremetal()

2023-06-28T17:09:55Z

The caller of (recently added) matchup_vm_state_with_baremetal() is an __init function and it uses some __initdata data structures, so add the __init annotation to it for consistency. This addresses the following build warnings: WARNING: modpost: vmlinux: section mismatch in reference: matchup_vm_state_with_baremetal+0x51 (section: .text) -> intel_idle_max_cstate_reached (section: .init.text) WARNING: modpost: vmlinux: section mismatch in reference: matchup_vm_state_with_baremetal+0x62 (section: .text) -> cpuidle_state_table (section: .init.data) WARNING: modpost: vmlinux: section mismatch in reference: matchup_vm_state_with_baremetal+0x79 (section: .text) -> icpu (section: .init.data) Fixes: 0fac214bb75e ("intel_idle: Add a "Long HLT" C1 state for the VM guest mode") Reported-by: Randy Dunlap Signed-off-by: Rafael J. Wysocki Tested-by: Randy Dunlap # build-tested Reviewed-by: Randy Dunlap

intel_idle: Add a "Long HLT" C1 state for the VM guest mode

2023-06-21T17:46:58Z

intel_idle will, for the bare metal case, usually have one or more deep power states that have the CPUIDLE_FLAG_TLB_FLUSHED flag set. When a state with this flag is selected by the cpuidle framework, it will also flush the TLBs as part of entering this state. The benefit of doing this is that the kernel does not need to wake the cpu out of this deep power state just to flush the TLBs... for which the latency can be very high due to the exit latency of deep power states. In a VM guest currently, this benefit of avoiding the wakeup does not exist, while the problem (long exit latency) is even more severe. Linux will need to wake up a vCPU (causing the host to either come out of a deep C state, or the VMM to have to deschedule something else to schedule the vCPU) which can take a very long time.. adding a lot of latency to tlb flush operations (including munmap and others). To solve this, add a "Long HLT" C state to the state table for the VM guest case that has the CPUIDLE_FLAG_TLB_FLUSHED flag set. The result of that is that for long idle periods (where the VMM is likely to do things that cause large latency) the cpuidle framework will flush the TLBs (and avoid the wakeups), while for short/quick idle durations, the existing behavior is retained. Now, there is still only "hlt" available in the guest, but for long idle, the host can go to a deeper state (say C6). There is a reasonable debate one can have to what to set for the exit_latency and break even point for this "Long HLT" state. The good news is that intel_idle has these values available for the underlying CPU (even when mwait is not exposed). The solution thus is to just use the latency and break even of the deepest state from the bare metal CPU. This is under the assumption that this is a pretty reasonable estimate of what the VMM would do to cause latency. Signed-off-by: Arjan van de Ven Signed-off-by: Rafael J. Wysocki

intel_idle: Add support for using intel_idle in a VM guest using just hlt

2023-06-16T17:12:36Z

In a typical VM guest, the mwait instruction is not available, leaving only the 'hlt' instruction (which causes a VMEXIT to the host). So for this common case, intel_idle will detect the lack of mwait, and fail to initialize (after which another idle method would step in which will just use hlt always). Other (non-common) cases exist; the table below shows the before/after for these: +------------+--------------------------+-------------------------+ | Hypervisor | Idle method before patch | Idle method after patch | | exposes | | | +============+==========================+=========================+ | nothing | default_idle fallback | intel_idle VM table | | (common) | (straight "hlt") | | +------------+--------------------------+-------------------------+ | mwait | intel_idle mwait table | intel_idle mwait table | +------------+--------------------------+-------------------------+ | ACPI | ACPI C1 state ("hlt") | intel_idle VM table | +------------+--------------------------+-------------------------+ This is only applicable to CPUs known by intel_idle. For the bare metal case, unknown CPU models will use the ACPI tables (when available) to get estimates for latency and break even point for longer idle states. In guests, the common case is that ACPI tables are not available, but even when they are available, they can't and don't provide the latency information for the longer (mwait based) states. For this scenario (unknown CPU model), the default_idle mode (no ACPI) or ACPI C1 (ACPI avaible) will be used. By providing capability to do this with the intel_idle driver, we can do better than the fallback or ACPI table methods. While this current change only gets us to the existing behavior, later patches in this series will add new capabilities such as optimized TLB flushing. In order to do this, a simplified version of the initialization function for VM guests is created, and this will be called if the CPU is recognized, but mwait is not supported, and we're in a VM guest. One thing to note is that the max latency (and break even) of this C1 state is higher than the typical bare metal C1 state. Because hlt causes a vmexit, and the cost of vmexit + hypervisor overhead + vmenter is typically in the order of upto 5 microseconds... even if the hypervisor does not actually goes into a hardware power saving state. Signed-off-by: Arjan van de Ven [ rjw: Dropped redundant checks from should_verify_mwait() ] Signed-off-by: Rafael J. Wysocki