| Age | Commit message (Collapse) | Author | Lines |
|
Now that the x86 topology code has a sensible nodes-per-package
measure, that does not depend on the online status of CPUs, use this
to divinate the SNC mode.
Note that when Cluster on Die (CoD) is configured on older systems this
will also show multiple NUMA nodes per package. Intel Resource Director
Technology is incomaptible with CoD. Print a warning and do not use the
fixup MSR_RMID_SNC_CONFIG.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Link: https://patch.msgid.link/aaCxbbgjL6OZ6VMd@agluck-desk3
Link: https://patch.msgid.link/20260303110100.367976706@infradead.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
- Extend the resctrl machinery to support telemetry monitoring on
Intel (Tony Luck)
The practical usage of this is being able to tell how much energy or
how much work can be attributed to a group of tasks tracked under a
single idenitifier. Prepend this work with proper refactoring of
resctrl domains handling code.
* tag 'x86_cache_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
x86,fs/resctrl: Update documentation for telemetry events
x86/resctrl: Enable RDT_RESOURCE_PERF_PKG
fs/resctrl: Move RMID initialization to first mount
x86,fs/resctrl: Compute number of RMIDs as minimum across resources
fs/resctrl: Move allocation/free of closid_num_dirty_rmid[]
x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
x86/resctrl: Add energy/perf choices to rdt boot option
x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG
fs/resctrl: Refactor rmdir_mondata_subdir_allrdtgrp()
fs/resctrl: Refactor mkdir_mondata_subdir()
x86/resctrl: Read telemetry events
x86/resctrl: Find and enable usable telemetry events
x86,fs/resctrl: Add architectural event pointer
x86,fs/resctrl: Fill in details of events for performance and energy GUIDs
x86/resctrl: Discover hardware telemetry events
fs/resctrl: Emphasize that L3 monitoring resource is required for summing domains
x86,fs/resctrl: Add and initialize a resource for package scope monitoring
x86,fs/resctrl: Add an architectural hook called for first mount
x86,fs/resctrl: Support binary fixed point event counters
x86,fs/resctrl: Handle events that can be read from any CPU
...
|
|
The memory bandwidth calculation relies on reading the hardware counter
and measuring the delta between samples. To ensure accurate measurement,
the software reads the counter frequently enough to prevent it from
rolling over twice between reads.
The default Memory Bandwidth Monitoring (MBM) counter width is 24 bits.
Hygon CPUs provide a 32-bit width counter, but they do not support the
MBM capability CPUID leaf (0xF.[ECX=1]:EAX) to report the width offset
(from 24 bits).
Consequently, the kernel falls back to the 24-bit default counter width,
which causes incorrect overflow handling on Hygon CPUs.
Fix this by explicitly setting the counter width offset to 8 bits (resulting
in a 32-bit total counter width) for Hygon CPUs.
Fixes: d8df126349da ("x86/cpu/hygon: Add missing resctrl_cpu_detect() in bsp_init helper")
Signed-off-by: Xiaochen Shen <shenxiaochen@open-hieco.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251209062650.1536952-3-shenxiaochen@open-hieco.net
|
|
Hygon CPUs supporting Platform QoS features currently undergo partial resctrl
initialization through resctrl_cpu_detect() in the Hygon BSP init helper and
AMD/Hygon common initialization code. However, several critical data
structures remain uninitialized for Hygon CPUs in the following paths:
- get_mem_config()-> __rdt_get_mem_config_amd():
rdt_resource::membw,alloc_capable
hw_res::num_closid
- rdt_init_res_defs()->rdt_init_res_defs_amd():
rdt_resource::cache
hw_res::msr_base,msr_update
Add the missing AMD/Hygon common initialization to ensure proper Platform QoS
functionality on Hygon CPUs.
Fixes: d8df126349da ("x86/cpu/hygon: Add missing resctrl_cpu_detect() in bsp_init helper")
Signed-off-by: Xiaochen Shen <shenxiaochen@open-hieco.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251209062650.1536952-2-shenxiaochen@open-hieco.net
|
|
Since telemetry events are enumerated on resctrl mount the RDT_RESOURCE_PERF_PKG
resource is not considered "monitoring capable" during early resctrl initialization.
This means that the domain list for RDT_RESOURCE_PERF_PKG is not built when the CPU
hotplug notifiers are registered and run for the first time right after resctrl
initialization.
Mark the RDT_RESOURCE_PERF_PKG as "monitoring capable" upon successful telemetry
event enumeration to ensure future CPU hotplug events include this resource and
initialize its domain list for CPUs that are already online.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
resctrl assumes that only the L3 resource supports monitor events, so it
simply takes the rdt_resource::num_rmid from RDT_RESOURCE_L3 as the system's
number of RMIDs.
The addition of telemetry events in a different resource breaks that
assumption.
Compute the number of available RMIDs as the minimum value across all
mon_capable resources (analogous to how the number of CLOSIDs is computed
across alloc_capable resources).
Note that mount time enumeration of the telemetry resource means that
this number can be reduced. If this happens, then some memory will
be wasted as the allocations for rdt_l3_mon_domain::mbm_states[] and
rdt_l3_mon_domain::rmid_busy_llc created during resctrl initialization will
be larger than needed.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
There are now three meanings for "number of RMIDs":
1) The number for legacy features enumerated by CPUID leaf 0xF. This is the
maximum number of distinct values that can be loaded into MSR_IA32_PQR_ASSOC.
Note that systems with Sub-NUMA Cluster mode enabled will force scaling down
the CPUID enumerated value by the number of SNC nodes per L3-cache.
2) The number of registers in MMIO space for each event. This is enumerated in
the XML files and is the value initialized into event_group::num_rmid.
3) The number of "hardware counters" (this isn't a strictly accurate
description of how things work, but serves as a useful analogy that does
describe the limitations) feeding to those MMIO registers. This is enumerated
in telemetry_region::num_rmids returned by intel_pmt_get_regions_by_feature().
Event groups with insufficient "hardware counters" to track all RMIDs are
difficult for users to use, since the system may reassign "hardware counters"
at any time. This means that users cannot reliably collect two consecutive
event counts to compute the rate at which events are occurring.
Disable such event groups by default. The user may override this with
a command line "rdt=" option. In this case limit an under-resourced event
group's number of possible monitor resource groups to the lowest number of
"hardware counters".
Scan all enabled event groups and assign the RDT_RESOURCE_PERF_PKG resource
"num_rmid" value to the smallest of these values as this value will be used
later to compare against the number of RMIDs supported by other resources to
determine how many monitoring resource groups are supported.
N.B. Change type of resctrl_mon::num_rmid to u32 to match its usage and the
type of event_group::num_rmid so that min(r->num_rmid, e->num_rmid) won't
complain about mixing signed and unsigned types.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
Legacy resctrl features are enumerated by X86_FEATURE_* flags. These may be
overridden by quirks to disable features in the case of errata. Users can use
kernel command line options to either disable a feature, or to force enable
a feature that was disabled by a quirk.
A different approach is needed for hardware features that do not have an
X86_FEATURE_* flag.
Update parsing of the "rdt=" boot parameter to call the telemetry driver
directly to handle new "perf" and "energy" options that controls activation of
telemetry monitoring of the named type. By itself a "perf" or "energy" option
controls the forced enabling or disabling (with ! prefix) of all event groups
of the named type. A ":guid" suffix allows for fine grained control per event
group.
[ bp: s/intel_aet_option/intel_handle_aet_option/g ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
The L3 resource has several requirements for domains. There are per-domain
structures that hold the 64-bit values of counters, and elements to keep
track of the overflow and limbo threads.
None of these are needed for the PERF_PKG resource. The hardware counters
are wide enough that they do not wrap around for decades.
Define a new rdt_perf_pkg_mon_domain structure which just consists of the
standard rdt_domain_hdr to keep track of domain id and CPU mask.
Update resctrl_online_mon_domain() for RDT_RESOURCE_PERF_PKG. The only action
needed for this resource is to create and populate domain directories if a
domain is added while resctrl is mounted.
Similarly resctrl_offline_mon_domain() only needs to remove domain directories.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
Introduce intel_aet_read_event() to read telemetry events for resource
RDT_RESOURCE_PERF_PKG. There may be multiple aggregators tracking each
package, so scan all of them and add up all counters. Aggregators may return
an invalid data indication if they have received no records for a given RMID.
The user will see "Unavailable" if none of the aggregators on a package
provide valid counts.
Resctrl now uses readq() so depends on X86_64. Update Kconfig.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
Every event group has a private copy of the data of all telemetry event
aggregators (aka "telemetry regions") tracking its feature type. Included
may be regions that have the same feature type but tracking different GUID
from the event group's.
Traverse the event group's telemetry region data and mark all regions that
are not usable by the event group as unusable by clearing those regions'
MMIO addresses. A region is considered unusable if:
1) GUID does not match the GUID of the event group.
2) Package ID is invalid.
3) The enumerated size of the MMIO region does not match the expected
value from the XML description file.
Hereafter any telemetry region with an MMIO address is considered valid for
the event group it is associated with.
Enable all the event group's events as long as there is at least one usable
region from where data for its events can be read. Enabling of an event can
fail if the same event has already been enabled as part of another event
group. It should never happen that the same event is described by different
GUID supported by the same system so just WARN (via resctrl_enable_mon_event())
and skip the event.
Note that it is architecturally possible that some telemetry events are only
supported by a subset of the packages in the system. It is not expected that
systems will ever do this. If they do the user will see event files in resctrl
that always return "Unavailable".
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
The resctrl file system layer passes the domain, RMID, and event id to the
architecture to fetch an event counter.
Fetching a telemetry event counter requires additional information that is
private to the architecture, for example, the offset into MMIO space from
where the counter should be read.
Add mon_evt::arch_priv that architecture can use for any private data related
to the event. The resctrl filesystem initializes mon_evt::arch_priv when the
architecture enables the event and passes it back to architecture when needing
to fetch an event counter.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
The telemetry event aggregators of the Intel Clearwater Forest CPU support two
RMID-based feature types: "energy" with GUID 0x26696143¹, and "perf" with
GUID 0x26557651².
The event counter offsets in an aggregator's MMIO space are arranged in groups
for each RMID.
E.g., the "energy" counters for GUID 0x26696143 are arranged like this:
MMIO offset:0x0000 Counter for RMID 0 PMT_EVENT_ENERGY
MMIO offset:0x0008 Counter for RMID 0 PMT_EVENT_ACTIVITY
MMIO offset:0x0010 Counter for RMID 1 PMT_EVENT_ENERGY
MMIO offset:0x0018 Counter for RMID 1 PMT_EVENT_ACTIVITY
...
MMIO offset:0x23F0 Counter for RMID 575 PMT_EVENT_ENERGY
MMIO offset:0x23F8 Counter for RMID 575 PMT_EVENT_ACTIVITY
After all counters there are three status registers that provide indications
of how many times an aggregator was unable to process event counts, the time
stamp for the most recent loss of data, and the time stamp of the most recent
successful update.
MMIO offset:0x2400 AGG_DATA_LOSS_COUNT
MMIO offset:0x2408 AGG_DATA_LOSS_TIMESTAMP
MMIO offset:0x2410 LAST_UPDATE_TIMESTAMP
Define event_group structures for both of these aggregator types and define
the events tracked by the aggregators in the file system code.
PMT_EVENT_ENERGY and PMT_EVENT_ACTIVITY are produced in fixed point format.
File system code must output as floating point values.
¹https://github.com/intel/Intel-PMT/blob/main/xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
²https://github.com/intel/Intel-PMT/blob/main/xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml
[ bp: Massage commit message. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
Each CPU collects data for telemetry events that it sends to the nearest
telemetry event aggregator either when the value of MSR_IA32_PQR_ASSOC.RMID
changes, or when a two millisecond timer expires.
There is a feature type ("energy" or "perf"), GUID, and MMIO region associated
with each aggregator. This combination links to an XML description of the
set of telemetry events tracked by the aggregator. XML files are published
by Intel in a GitHub repository¹.
The telemetry event aggregators maintain per-RMID per-event counts of the
total seen for all the CPUs. There may be multiple telemetry event aggregators
per package.
There are separate sets of aggregators for each feature type. Aggregators
in a set may have different GUIDs. All aggregators with the same feature
type and GUID are symmetric keeping counts for the same set of events for
the CPUs that provide data to them.
The XML file for each aggregator provides the following information:
0) Feature type of the events ("perf" or "energy")
1) Which telemetry events are tracked by the aggregator.
2) The order in which the event counters appear for each RMID.
3) The value type of each event counter (integer or fixed-point).
4) The number of RMIDs supported.
5) Which additional aggregator status registers are included.
6) The total size of the MMIO region for an aggregator.
Introduce struct event_group that condenses the relevant information from
an XML file. Hereafter an "event group" refers to a group of events of a
particular feature type (event_group::pfname set to "energy" or "perf") with
a particular GUID.
Use event_group::pfname to determine the feature id needed to obtain the
aggregator details. It will later be used in console messages and with the
rdt= boot parameter.
The INTEL_PMT_TELEMETRY driver enumerates support for telemetry events.
This driver provides intel_pmt_get_regions_by_feature() to list all available
telemetry event aggregators of a given feature type. The list includes the
"guid", the base address in MMIO space for the region where the event counters
are exposed, and the package id where the all the CPUs that report to this
aggregator are located.
Call INTEL_PMT_TELEMETRY's intel_pmt_get_regions_by_feature() for each event
group to obtain a private copy of that event group's aggregator data. Duplicate
the aggregator data between event groups that have the same feature type
but different GUID. Further processing on this private copy will be unique
to the event group.
¹https://github.com/intel/Intel-PMT
[ bp: Zap text explaining the code, s/guid/GUID/g ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
Add a new PERF_PKG resource and introduce package level scope for monitoring
telemetry events so that CPU hotplug notifiers can build domains at the
package granularity.
Use the physical package ID available via topology_physical_package_id()
to identify the monitoring domains with package level scope. This enables
user space to use:
/sys/devices/system/cpu/cpuX/topology/physical_package_id
to identify the monitoring domain a CPU is associated with.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
Enumeration of Intel telemetry events is an asynchronous process involving
several mutually dependent drivers added as auxiliary devices during the
device_initcall() phase of Linux boot. The process finishes after the probe
functions of these drivers completes. But this happens after
resctrl_arch_late_init() is executed.
Tracing the enumeration process shows that it does complete a full seven
seconds before the earliest possible mount of the resctrl file system (when
included in /etc/fstab for automatic mount by systemd).
Add a hook for use by telemetry event enumeration and initialization and
run it once at the beginning of resctrl mount without any locks held.
The architecture is responsible for any required locking.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20260105191711.GBaVwON5nZn-uO6Sqg@fat_crate.local
|
|
resctrl assumes that all monitor events can be displayed as unsigned decimal
integers.
Hardware architecture counters may provide some telemetry events with greater
precision where the event is not a simple count, but is a measurement of some
sort (e.g. Joules for energy consumed).
Add a new argument to resctrl_enable_mon_event() for architecture code to
inform the file system that the value for a counter is a fixed-point value
with a specific number of binary places.
Only allow architecture to use floating point format on events that the file
system has marked with mon_evt::is_floating_point which reflects the contract
with user space on how the event values are displayed.
Display fixed point values with values rounded to ceil(binary_bits * log10(2))
decimal places. Special case for zero binary bits to print "{value}.0".
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
resctrl assumes that monitor events can only be read from a CPU in the
cpumask_t set of each domain. This is true for x86 events accessed with an
MSR interface, but may not be true for other access methods such as MMIO.
Introduce and use flag mon_evt::any_cpu, settable by architecture, that
indicates there are no restrictions on which CPU can read that event. This
flag is not supported by the L3 event reading that requires to be run on a CPU
that belongs to the L3 domain of the event being read.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
With the arrival of monitor events tied to new domains associated with a
different resource it would be clearer if the L3 resource specific functions
are more accurately named.
Rename three groups of functions:
Functions that allocate/free architecture per-RMID MBM state information:
arch_domain_mbm_alloc() -> l3_mon_domain_mbm_alloc()
mon_domain_free() -> l3_mon_domain_free()
Functions that allocate/free filesystem per-RMID MBM state information:
domain_setup_mon_state() -> domain_setup_l3_mon_state()
domain_destroy_mon_state() -> domain_destroy_l3_mon_state()
Initialization/exit:
rdt_get_mon_l3_config() -> rdt_get_l3_mon_config()
resctrl_mon_resource_init() -> resctrl_l3_mon_resource_init()
resctrl_mon_resource_exit() -> resctrl_l3_mon_resource_exit()
Ensure kernel-doc descriptions of these functions' return values are present
and correctly formatted.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
The upcoming telemetry event monitoring is not tied to the L3 resource and
will have a new domain structure.
Rename the L3 resource specific domain data structures to include "l3_"
in their names to avoid confusion between the different resource specific
domain structures:
rdt_mon_domain -> rdt_l3_mon_domain
rdt_hw_mon_domain -> rdt_hw_l3_mon_domain
No functional change.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
Convert the whole call sequence from mon_event_read() to resctrl_arch_rmid_read() to
pass resource independent struct rdt_domain_hdr instead of an L3 specific domain
structure to prepare for monitoring events in other resources.
This additional layer of indirection obscures which aspects of event counting depend
on a valid domain. Event initialization, support for assignable counters, and normal
event counting implicitly depend on a valid domain while summing of domains does not.
Split summing domains from the core event counting handling to make their respective
dependencies obvious.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
Up until now, all monitoring events were associated with the L3 resource and it
made sense to use the L3 specific "struct rdt_mon_domain *" argument to functions
operating on domains.
Telemetry events will be tied to a new resource with its instances represented
by a new domain structure that, just like struct rdt_mon_domain, starts with
the generic struct rdt_domain_hdr.
Prepare to support domains belonging to different resources by changing the
calling convention of functions operating on domains. Pass the generic header
and use that to find the domain specific structure where needed.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
For symmetry with domain_remove_cpu_mon() refactor domain_remove_cpu_ctrl()
to take an early return when removing a CPU does not empty the domain.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
New telemetry events will be associated with a new package scoped resource
with a new domain structure.
Refactor domain_remove_cpu_mon() so all the L3 domain processing is separate
from the general domain action of clearing the CPU bit in the mask.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
Carve out the resource monitoring domain init code into a separate helper in
order to be able to initialize new types of monitoring domains besides the
usual L3 ones.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
Every resctrl resource has a list of domain structures. struct rdt_ctrl_domain
and struct rdt_mon_domain both begin with struct rdt_domain_hdr with
rdt_domain_hdr::type used in validity checks before accessing the domain of
a particular type.
Add the resource id to struct rdt_domain_hdr in preparation for a new monitoring
domain structure that will be associated with a new monitoring resource. Improve
existing domain validity checks with a new helper domain_header_is_valid()
that checks both domain type and resource id. domain_header_is_valid() should
be used before every call to container_of() that accesses a domain structure.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
- Add support for AMD's Smart Data Cache Injection feature which allows
for direct insertion of data from I/O devices into the L3 cache, thus
bypassing DRAM and saving its bandwidth; the resctrl side of the
feature allows the size of the L3 used for data injection to be
controlled
- Add Intel Clearwater Forest to the list of CPUs which support
Sub-NUMA clustering
- Other fixes and cleanups
* tag 'x86_cache_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/resctrl: Update bit_usage to reflect io_alloc
fs/resctrl: Introduce interface to modify io_alloc capacity bitmasks
fs/resctrl: Modify struct rdt_parse_data to pass mode and CLOSID
fs/resctrl: Introduce interface to display io_alloc CBMs
fs/resctrl: Add user interface to enable/disable io_alloc feature
fs/resctrl: Introduce interface to display "io_alloc" support
x86,fs/resctrl: Implement "io_alloc" enable/disable handlers
x86,fs/resctrl: Detect io_alloc feature
x86/resctrl: Add SDCIAE feature in the command line options
x86/cpufeatures: Add support for L3 Smart Data Cache Injection Allocation Enforcement
fs/resctrl: Consider sparse masks when initializing new group's allocation
x86/resctrl: Support Sub-NUMA Cluster (SNC) mode on Clearwater Forest
|
|
"io_alloc" is the generic name of the new resctrl feature that enables system
software to configure the portion of cache allocated for I/O traffic. On AMD
systems, "io_alloc" resctrl feature is backed by AMD's L3 Smart Data Cache
Injection Allocation Enforcement (SDCIAE).
Introduce the architecture-specific functions that resctrl fs should call to
enable, disable, or check status of the "io_alloc" feature. Change SDCIAE state
by setting (to enable) or clearing (to disable) bit 1 of
MSR_IA32_L3_QOS_EXT_CFG on all logical processors within the cache domain.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/9e9070100c320eab5368e088a3642443dee95ed7.1762995456.git.babu.moger@amd.com
|
|
AMD's SDCIAE (SDCI Allocation Enforcement) PQE feature enables system software
to control the portions of L3 cache used for direct insertion of data from I/O
devices into the L3 cache.
Introduce a generic resctrl cache resource property "io_alloc_capable" as the
first part of the new "io_alloc" resctrl feature that will support AMD's
SDCIAE. Any architecture can set a cache resource as "io_alloc_capable" if
a portion of the cache can be allocated for I/O traffic.
Set the "io_alloc_capable" property for the L3 cache resource on x86 (AMD)
systems that support SDCIAE.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/df85a9a6081674fd3ef6b4170920485512ce2ded.1762995456.git.babu.moger@amd.com
|
|
Add a kernel command-line parameter to enable or disable the exposure of
the L3 Smart Data Cache Injection Allocation Enforcement (SDCIAE) hardware
feature to resctrl.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/c623edf7cb369ba9da966de47d9f1b666778a40e.1762995456.git.babu.moger@amd.com
|
|
mbm_event mode
The following NULL pointer dereference is encountered on mount of resctrl fs
after booting a system that supports assignable counters with the
"rdt=!mbmtotal,!mbmlocal" kernel parameters:
BUG: kernel NULL pointer dereference, address: 0000000000000008
RIP: 0010:mbm_cntr_get
Call Trace:
rdtgroup_assign_cntr_event
rdtgroup_assign_cntrs
rdt_get_tree
Specifying the kernel parameter "rdt=!mbmtotal,!mbmlocal" effectively disables
the legacy X86_FEATURE_CQM_MBM_TOTAL and X86_FEATURE_CQM_MBM_LOCAL features
and the MBM events they represent. This results in the per-domain MBM event
related data structures to not be allocated during early initialization.
resctrl fs initialization follows by implicitly enabling both MBM total and
local events on a system that supports assignable counters (mbm_event mode),
but this enabling occurs after the per-domain data structures have been
created.
After booting, resctrl fs assumes that an enabled event can access all its
state. This results in NULL pointer dereference when resctrl attempts to
access the un-allocated structures of an enabled event.
Remove the late MBM event enabling from resctrl fs.
This leaves a problem where the X86_FEATURE_CQM_MBM_TOTAL and
X86_FEATURE_CQM_MBM_LOCAL features may be disabled while assignable counter
(mbm_event) mode is enabled without any events to support. Switching between
the "default" and "mbm_event" mode without any events is not practical.
Create a dependency between the X86_FEATURE_{CQM_MBM_TOTAL,CQM_MBM_LOCAL} and
X86_FEATURE_ABMC (assignable counter) hardware features. An x86 system that
supports assignable counters now requires support of X86_FEATURE_CQM_MBM_TOTAL
or X86_FEATURE_CQM_MBM_LOCAL.
This ensures all needed MBM related data structures are created before use and
that it is only possible to switch between "default" and "mbm_event" mode when
the same events are available in both modes. This dependency does not exist in
the hardware but this usage of these feature settings work for known systems.
[ bp: Massage commit message. ]
Fixes: 13390861b426e ("x86,fs/resctrl: Detect Assignable Bandwidth Monitoring feature details")
Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/a62e6ac063d0693475615edd213d5be5e55443e6.1760560934.git.babu.moger@amd.com
|
|
unavailable RMID
Users can create as many monitoring groups as the number of RMIDs supported
by the hardware. However, on AMD systems, only a limited number of RMIDs
are guaranteed to be actively tracked by the hardware. RMIDs that exceed
this limit are placed in an "Unavailable" state.
When a bandwidth counter is read for such an RMID, the hardware sets
MSR_IA32_QM_CTR.Unavailable (bit 62). When such an RMID starts being tracked
again the hardware counter is reset to zero. MSR_IA32_QM_CTR.Unavailable
remains set on first read after tracking re-starts and is clear on all
subsequent reads as long as the RMID is tracked.
resctrl miscounts the bandwidth events after an RMID transitions from the
"Unavailable" state back to being tracked. This happens because when the
hardware starts counting again after resetting the counter to zero, resctrl
in turn compares the new count against the counter value stored from the
previous time the RMID was tracked.
This results in resctrl computing an event value that is either undercounting
(when new counter is more than stored counter) or a mistaken overflow (when
new counter is less than stored counter).
Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR to
zero whenever the RMID is in the "Unavailable" state to ensure accurate
counting after the RMID resets to zero when it starts to be tracked again.
Example scenario that results in mistaken overflow
==================================================
1. The resctrl filesystem is mounted, and a task is assigned to a
monitoring group.
$mount -t resctrl resctrl /sys/fs/resctrl
$mkdir /sys/fs/resctrl/mon_groups/test1/
$echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks
$cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
21323 <- Total bytes on domain 0
"Unavailable" <- Total bytes on domain 1
Task is running on domain 0. Counter on domain 1 is "Unavailable".
2. The task runs on domain 0 for a while and then moves to domain 1. The
counter starts incrementing on domain 1.
$cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
7345357 <- Total bytes on domain 0
4545 <- Total bytes on domain 1
3. At some point, the RMID in domain 0 transitions to the "Unavailable"
state because the task is no longer executing in that domain.
$cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
"Unavailable" <- Total bytes on domain 0
434341 <- Total bytes on domain 1
4. Since the task continues to migrate between domains, it may eventually
return to domain 0.
$cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
17592178699059 <- Overflow on domain 0
3232332 <- Total bytes on domain 1
In this case, the RMID on domain 0 transitions from "Unavailable" state to
active state. The hardware sets MSR_IA32_QM_CTR.Unavailable (bit 62) when
the counter is read and begins tracking the RMID counting from 0.
Subsequent reads succeed but return a value smaller than the previously
saved MSR value (7345357). Consequently, the resctrl's overflow logic is
triggered, it compares the previous value (7345357) with the new, smaller
value and incorrectly interprets this as a counter overflow, adding a large
delta.
In reality, this is a false positive: the counter did not overflow but was
simply reset when the RMID transitioned from "Unavailable" back to active
state.
Here is the text from APM [1] available from [2].
"In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
first QM_CTR read when it begins tracking an RMID that it was not
previously tracking. The U bit will be zero for all subsequent reads from
that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
read with the U bit set when that RMID is in use by a processor can be
considered 0 when calculating the difference with a subsequent read."
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
Bandwidth (MBM).
[ bp: Split commit message into smaller paragraph chunks for better
consumption. ]
Fixes: 4d05bf71f157d ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org # needs adjustments for <= v6.17
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
|
|
Clearwater Forest supports SNC mode. Add it to the snc_cpu_ids[] table.
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
|
|
Configure mbm_event mode on AMD platforms. On AMD platforms, it is
recommended to use the mbm_event mode, if supported, to prevent the
hardware from resetting counters between reads. This can result in
misleading values or display "Unavailable" if no counter is assigned
to the event.
Enable mbm_event mode, known as ABMC (Assignable Bandwidth Monitoring
Counters) on AMD, by default if the system supports it.
Update ABMC across all logical processors within the resctrl domain to
ensure proper functionality.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
|
|
System software reads resctrl event data for a particular resource by writing
the RMID and Event Identifier (EvtID) to the QM_EVTSEL register and then
reading the event data from the QM_CTR register.
In ABMC mode, the event data of a specific counter ID is read by setting the
following fields: QM_EVTSEL.ExtendedEvtID = 1, QM_EVTSEL.EvtID = L3CacheABMC
(=1) and setting QM_EVTSEL.RMID to the desired counter ID. Reading the QM_CTR
then returns the contents of the specified counter ID. RMID_VAL_ERROR bit is
set if the counter configuration is invalid, or if an invalid counter ID is
set in the QM_EVTSEL.RMID field. RMID_VAL_UNAVAIL bit is set if the counter
data is unavailable.
Introduce resctrl_arch_reset_cntr() and resctrl_arch_cntr_read() to reset and
read event data for a specific counter.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
|
|
resctrl_arch_rmid_read() adjusts the value obtained from MSR_IA32_QM_CTR to
account for the overflow for MBM events and apply counter scaling for all the
events. This logic is common to both reading an RMID and reading a hardware
counter directly.
Refactor the hardware value adjustment logic into get_corrected_val() to
prepare for support of reading a hardware counter.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
|
|
with ABMC
The ABMC feature allows users to assign a hardware counter to an RMID,
event pair and monitor bandwidth usage as long as it is assigned. The
hardware continues to track the assigned counter until it is explicitly
unassigned by the user.
Implement an x86 architecture-specific handler to configure a counter. This
architecture specific handler is called by resctrl fs when a counter is
assigned or unassigned as well as when an already assigned counter's
configuration should be updated. Configure counters by writing to the
L3_QOS_ABMC_CFG MSR, specifying the counter ID, bandwidth source (RMID),
and event configuration.
The ABMC feature details are documented in APM [1] available from [2].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
|
|
The ABMC feature allows users to assign a hardware counter to an RMID,
event pair and monitor bandwidth usage as long as it is assigned. The
hardware continues to track the assigned counter until it is explicitly
unassigned by the user.
The ABMC feature implements an MSR L3_QOS_ABMC_CFG (C000_03FDh).
ABMC counter assignment is done by setting the counter id, bandwidth
source (RMID) and bandwidth configuration.
Attempts to read or write the MSR when ABMC is not enabled will result
in a #GP(0) exception.
Introduce the data structures and definitions for MSR L3_QOS_ABMC_CFG
(0xC000_03FDh):
=========================================================================
Bits Mnemonic Description Access Reset
Type Value
=========================================================================
63 CfgEn Configuration Enable R/W 0
62 CtrEn Enable/disable counting R/W 0
61:53 – Reserved MBZ 0
52:48 CtrID Counter Identifier R/W 0
47 IsCOS BwSrc field is a CLOSID R/W 0
(not an RMID)
46:44 – Reserved MBZ 0
43:32 BwSrc Bandwidth Source R/W 0
(RMID or CLOSID)
31:0 BwType Bandwidth configuration R/W 0
tracked by the CtrID
==========================================================================
The ABMC feature details are documented in APM [1] available from [2].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).
[ bp: Touchups. ]
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
|
|
Add the functionality to enable/disable the AMD ABMC feature.
The AMD ABMC feature is enabled by setting enabled bit(0) in the
L3_QOS_EXT_CFG MSR. When the state of ABMC is changed, the MSR needs to be
updated on all the logical processors in the QOS Domain.
Hardware counters will reset when ABMC state is changed.
[ bp: Massage commit message. ]
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
|
|
ABMC feature details are reported via CPUID Fn8000_0020_EBX_x5.
Bits Description
15:0 MAX_ABMC Maximum Supported Assignable Bandwidth
Monitoring Counter ID + 1
The ABMC feature details are documented in APM [1] available from [2].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).
Detect the feature and number of assignable counters supported. For backward
compatibility, upon detecting the assignable counter feature, enable the
mbm_total_bytes and mbm_local_bytes events that users are familiar with as
part of original L3 MBM support.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
|
|
The cache allocation and memory bandwidth allocation feature properties are
consolidated into struct resctrl_cache and struct resctrl_membw respectively.
In preparation for more monitoring properties that will clobber the existing
resource struct more, re-organize the monitoring specific properties to also
be in a separate structure.
Also convert "bandwidth sources" terminology to "memory transactions" to have
consistency within resctrl for related monitoring features.
[ bp: Massage commit message. ]
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
|
|
Add a kernel command-line parameter to enable or disable the exposure of
the ABMC (Assignable Bandwidth Monitoring Counters) hardware feature to
resctrl.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
|
|
There's a rule in computer programming that objects appear zero, once, or many
times. So code accordingly.
There are two MBM events and resctrl is coded with a lot of
if (local)
do one thing
if (total)
do a different thing
Change the rdt_mon_domain and rdt_hw_mon_domain structures to hold arrays of
pointers to per event data instead of explicit fields for total and local
bandwidth.
Simplify by coding for many events using loops on which are enabled.
Move resctrl_is_mbm_event() to <linux/resctrl.h> so it can be used more
widely. Also provide a for_each_mbm_event_id() helper macro.
Cleanup variable names in functions touched to consistently use "eventid" for
those with type enum resctrl_event_id.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
|
|
rdt_mon_features is used as a bitmask of enabled monitor events. A monitor
event's status is now maintained in mon_evt::enabled with all monitor events'
mon_evt structures found in the filesystem's mon_event_all[] array.
Remove the remaining uses of rdt_mon_features.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
|
|
The resctrl file system now has complete knowledge of the status of every
event. So there is no need for per-event function calls to check.
Replace each of the resctrl_arch_is_{event}enabled() calls with
resctrl_is_mon_event_enabled(QOS_{EVENT}).
No functional change.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
|
|
There are currently only three monitor events, all associated with the
RDT_RESOURCE_L3 resource. Growing support for additional events will be easier
with some restructuring to have a single point in file system code where all
attributes of all events are defined.
Place all event descriptions into an array mon_event_all[]. Doing this has the
beneficial side effect of removing the need for rdt_resource::evt_list.
Add resctrl_event_id::QOS_FIRST_EVENT for a lower bound on range checks for
event ids and as the starting index to scan mon_event_all[].
Drop the code that builds evt_list and change the two places where the list is
scanned to scan mon_event_all[] instead using a new helper macro
for_each_mon_event().
Architecture code now informs file system code which events are available with
resctrl_enable_mon_event().
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
|
|
subsystem
In the resctrl subsystem's Sub-NUMA Cluster (SNC) mode, the rdt_mon_domain
structure representing a NUMA node relies on the cacheinfo interface
(rdt_mon_domain::ci) to store L3 cache information (e.g., shared_cpu_map)
for monitoring. The L3 cache information of a SNC NUMA node determines
which domains are summed for the "top level" L3-scoped events.
rdt_mon_domain::ci is initialized using the first online CPU of a NUMA
node. When this CPU goes offline, its shared_cpu_map is cleared to contain
only the offline CPU itself. Subsequently, attempting to read counters
via smp_call_on_cpu(offline_cpu) fails (and error ignored), returning
zero values for "top-level events" without any error indication.
Replace the cacheinfo references in struct rdt_mon_domain and struct
rmid_read with the cacheinfo ID (a unique identifier for the L3 cache).
rdt_domain_hdr::cpu_mask contains the online CPUs associated with that
domain. When reading "top-level events", select a CPU from
rdt_domain_hdr::cpu_mask and utilize its L3 shared_cpu_map to determine
valid CPUs for reading RMID counter via the MSR interface.
Considering all CPUs associated with the L3 cache improves the chances
of picking a housekeeping CPU on which the counter reading work can be
queued, avoiding an unnecessary IPI.
Fixes: 328ea68874642 ("x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files")
Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250530182053.37502-2-qinyuntan@linux.alibaba.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
"Carve out the resctrl filesystem-related code into fs/resctrl/ so that
multiple architectures can share the fs API for manipulating their
respective hw resource control implementation.
This is the second step in the work towards sharing the resctrl
filesystem interface, the next one being plugging ARM's MPAM into the
aforementioned fs API"
* tag 'x86_cache_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
MAINTAINERS: Add reviewers for fs/resctrl
x86,fs/resctrl: Move the resctrl filesystem code to live in /fs/resctrl
x86/resctrl: Always initialise rid field in rdt_resources_all[]
x86/resctrl: Relax some asm #includes
x86/resctrl: Prefer alloc(sizeof(*foo)) idiom in rdt_init_fs_context()
x86/resctrl: Squelch whitespace anomalies in resctrl core code
x86/resctrl: Move pseudo lock prototypes to include/linux/resctrl.h
x86/resctrl: Fix types in resctrl_arch_mon_ctx_{alloc,free}() stubs
x86/resctrl: Move enum resctrl_event_id to resctrl.h
x86/resctrl: Move the filesystem bits to headers visible to fs/resctrl
fs/resctrl: Add boiler plate for external resctrl code
x86/resctrl: Add 'resctrl' to the title of the resctrl documentation
x86/resctrl: Split trace.h
x86/resctrl: Expand the width of domid by replacing mon_data_bits
x86/resctrl: Add end-marker to the resctrl_event_id enum
x86/resctrl: Move is_mba_sc() out of core.c
x86/resctrl: Drop __init/__exit on assorted symbols
x86/resctrl: Resctrl_exit() teardown resctrl but leave the mount point
x86/resctrl: Check all domains are offline in resctrl_exit()
x86/resctrl: Rename resctrl_sched_in() to begin with "resctrl_arch_"
...
|
|
Resctrl is a filesystem interface to hardware that provides cache
allocation policy and bandwidth control for groups of tasks or CPUs.
To support more than one architecture, resctrl needs to live in /fs/.
Move the code that is concerned with the filesystem interface to
/fs/resctrl.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-25-james.morse@arm.com
|
|
x86 has an array, rdt_resources_all[], of all possible resources.
The for-each-resource walkers depend on the rid field of all
resources being initialised.
If the array ever grows due to another architecture adding a resource
type that is not defined on x86, the for-each-resources walkers will
loop forever.
Initialise all the rid values in resctrl_arch_late_init() before
any for-each-resource walker can be called.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-24-james.morse@arm.com
|