| Age | Commit message (Collapse) | Author | Lines |
|
'intel/vt-d', 'amd/amd-vi' and 'core' into next
|
|
DMA aliasing causes interrupt remapping table entries (IRTEs) to be shared
between multiple device IDs. See commit 3c124435e8dd
("iommu/amd: Support multiple PCI DMA aliases in IRQ Remapping") for more
information on this. However, the AMD IOMMU driver currently invalidates
IRTE cache entries on a per-device basis whenever an IRTE is updated, not
for each alias.
This approach leaves stale IRTE cache entries when an IRTE is cached under
one DMA alias but later updated and invalidated through a different alias.
In such cases, the original device ID is never invalidated, since it is
programmed via aliasing.
This incoherency bug has been observed when IRTEs are cached for one
Non-Transparent Bridge (NTB) DMA alias, later updated via another.
Fix this by invalidating the interrupt remapping table cache for all DMA
aliases when updating an IRTE.
Co-developed-by: Lars B. Kristiansen <larsk@dolphinics.com>
Signed-off-by: Lars B. Kristiansen <larsk@dolphinics.com>
Co-developed-by: Jonas Markussen <jonas@dolphinics.com>
Signed-off-by: Jonas Markussen <jonas@dolphinics.com>
Co-developed-by: Tore H. Larsen <torel@simula.no>
Signed-off-by: Tore H. Larsen <torel@simula.no>
Signed-off-by: Magnus Kalland <magnus@dolphinics.com>
Link: https://lore.kernel.org/linux-iommu/9204da81-f821-4034-b8ad-501e43383b56@amd.com/
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Currently clone_alias() assumes first argument (pdev) is always the
original device pointer. This function is called by
pci_for_each_dma_alias() which based on topology decides to send
original or alias device details in first argument.
This meant that the source devid used to look up and copy the DTE
may be incorrect, leading to wrong or stale DTE entries being
propagated to alias device.
Fix this by passing the original pdev as the opaque data argument to
both the direct clone_alias() call and pci_for_each_dma_alias(). Inside
clone_alias(), retrieve the original device from data and compute devid
from it.
Fixes: 3332364e4ebc ("iommu/amd: Support multiple PCI DMA aliases in device table")
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
iommupt always supports the semantics required for DMA-FQ, when drivers
are converted to use it they automatically get support.
Detect iommpt directly instead of using IOMMU_CAP_DEFERRED_FLUSH and
remove IOMMU_CAP_DEFERRED_FLUSH from converted drivers.
This will also enable DMA-FQ on RISC-V.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
In the current AMD IOMMU debugfs, when multiple processes simultaneously
access the IOMMU mmio/cap registers using the IOMMU debugfs, illegal
access issues can occur in the following execution flow:
1. CPU1: Sets a valid access address using iommu_mmio/capability_write,
and verifies the access address's validity in iommu_mmio/capability_show
2. CPU2: Sets an invalid address using iommu_mmio/capability_write
3. CPU1: accesses the IOMMU mmio/cap registers based on the invalid
address, resulting in an illegal access.
This patch modifies the execution process to first verify the address's
validity and then access it based on the same address, ensuring
correctness and robustness.
Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
In the current AMD IOMMU debugFS, when multiple processes use the IOMMU
debugFS process simultaneously, illegal access issues can occur in the
following execution flow:
1. CPU1: Sets a valid sbdf via devid_write, then checks the sbdf's
validity in execution flows such as devid_show, iommu_devtbl_show,
and iommu_irqtbl_show.
2. CPU2: Sets an invalid sbdf via devid_write, at which point the sbdf
value is -1.
3. CPU1: accesses the IOMMU device table, IRQ table, based on the
invalid SBDF value of -1, resulting in illegal access.
This is especially problematic in monitoring scripts, where multiple
scripts may access debugFS simultaneously, and some scripts may
unexpectedly set invalid values, which triggers illegal access in
debugfs.
This patch modifies the execution flow of devid_show,
iommu_devtbl_show, and iommu_irqtbl_show to ensure that these
processes determine the validity and access based on the
same device-id, thus guaranteeing correctness and robustness.
Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
PCIe ATS may be disabled by platform firmware, root complex limitations,
or kernel policy even when a device advertises the ATS capability in its
PCI configuration space.
Add a new IOMMU_CAP_PCI_ATS_SUPPORTED capability to allow IOMMU drivers
to report the effective ATS decision for a device.
When this capability is true for a device, ATS may be enabled for that
device, but it does not imply that ATS is currently enabled.
A subsequent patch will extend iommufd to expose the effective ATS
status to userspace.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Previously, commit 8388f7df936b ("iommu/amd: Do not support
IOMMU_DOMAIN_IDENTITY after SNP is enabled") prevented users from
changing the IOMMU domain to identity if SNP was enabled.
This resulted in an error when writing to sysfs:
# echo "identity" > /sys/kernel/iommu_groups/50/type
-bash: echo: write error: Cannot allocate memory
However, commit 4402f2627d30 ("iommu/amd: Implement global identity
domain") changed the flow of the code, skipping the SNP guard and
allowing users to change the IOMMU domain to identity after a machine
has booted.
Once the user does that, they will probably try to bind and the
device/driver will start to do DMA which will trigger errors:
iommu ivhd3: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:43:00.0 pasid=0x00000 address=0x3737b01000 flags=0x0020]
iommu ivhd3: AMD-Vi: Control Reg : 0xc22000142148d
AMD-Vi: DTE[0]: 6000000000000003
AMD-Vi: DTE[1]: 0000000000000001
AMD-Vi: DTE[2]: 2000003088b3e013
AMD-Vi: DTE[3]: 0000000000000000
bnxt_en 0000:43:00.0 (unnamed net_device) (uninitialized): Error (timeout: 500015) msg {0x0 0x0} len:0
iommu ivhd3: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:43:00.0 pasid=0x00000 address=0x3737b01000 flags=0x0020]
iommu ivhd3: AMD-Vi: Control Reg : 0xc22000142148d
AMD-Vi: DTE[0]: 6000000000000003
AMD-Vi: DTE[1]: 0000000000000001
AMD-Vi: DTE[2]: 2000003088b3e013
AMD-Vi: DTE[3]: 0000000000000000
bnxt_en 0000:43:00.0: probe with driver bnxt_en failed with error -16
To prevent this from happening, create an attach wrapper for
identity_domain_ops which returns EINVAL if amd_iommu_snp_en is true.
With this commit applied:
# echo "identity" > /sys/kernel/iommu_groups/62/type
-bash: echo: write error: Invalid argument
Fixes: 4402f2627d30 ("iommu/amd: Implement global identity domain")
Signed-off-by: Joe Damato <joe@dama.to>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Currently, PPR Log and GA logs for AMD IOMMU are allocated using
iommu_alloc_pages_sz(), which does not account for NUMA affinity. This can
lead to remote memory access latencies if the memory is allocated on a
different node than the IOMMU hardware.
Switch to iommu_alloc_pages_node_sz() to ensure that these data structures
are allocated on the same NUMA node as the IOMMU device. If the node
information is unavailable, it defaults to NUMA_NO_NODE.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Conversion performed via this Coccinelle script:
// SPDX-License-Identifier: GPL-2.0-only
// Options: --include-headers-for-types --all-includes --include-headers --keep-comments
virtual patch
@gfp depends on patch && !(file in "tools") && !(file in "samples")@
identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
kzalloc_obj,kzalloc_objs,kzalloc_flex,
kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
@@
ALLOC(...
- , GFP_KERNEL
)
$ make coccicheck MODE=patch COCCI=gfp.cocci
Build and boot tested x86_64 with Fedora 42's GCC and Clang:
Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
"Core changes:
- Rust bindings for IO-pgtable code
- IOMMU page allocation debugging support
- Disable ATS during PCI resets
Intel VT-d changes:
- Skip dev-iotlb flush for inaccessible PCIe device
- Flush cache for PASID table before using it
- Use right invalidation method for SVA and NESTED domains
- Ensure atomicity in context and PASID entry updates
AMD-Vi changes:
- Support for nested translations
- Other minor improvements
ARM-SMMU-v2 changes:
- Configure SoC-specific prefetcher settings for Qualcomm's "MDSS"
ARM-SMMU-v3 changes:
- Improve CMDQ locking fairness for pathetically small queue sizes
- Remove tracking of the IAS as this is only relevant for AArch32 and
was causing C_BAD_STE errors
- Add device-tree support for NVIDIA's CMDQV extension
- Allow some hitless transitions for the 'MEV' and 'EATS' STE fields
- Don't disable ATS for nested S1-bypass nested domains
- Additions to the kunit selftests"
* tag 'iommu-updates-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (54 commits)
iommupt: Always add IOVA range to iotlb_gather in gather_range_pages()
iommu/amd: serialize sequence allocation under concurrent TLB invalidations
iommu/amd: Fix type of type parameter to amd_iommufd_hw_info()
iommu/arm-smmu-v3: Do not set disable_ats unless vSTE is Translate
iommu/arm-smmu-v3-test: Add nested s1bypass/s1dssbypass coverage
iommu/arm-smmu-v3: Mark EATS_TRANS safe when computing the update sequence
iommu/arm-smmu-v3: Mark STE MEV safe when computing the update sequence
iommu/arm-smmu-v3: Add update_safe bits to fix STE update sequence
iommu/arm-smmu-v3: Add device-tree support for CMDQV driver
iommu/tegra241-cmdqv: Decouple driver from ACPI
iommu/arm-smmu-qcom: Restore ACTLR settings for MDSS on sa8775p
iommu/vt-d: Fix race condition during PASID entry replacement
iommu/vt-d: Clear Present bit before tearing down context entry
iommu/vt-d: Clear Present bit before tearing down PASID entry
iommu/vt-d: Flush piotlb for SVM and Nested domain
iommu/vt-d: Flush cache for PASID table before using it
iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode
rust: iommu: fix `srctree` link warning
rust: iommu: fix Rust formatting
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq cleanups from Thomas Gleixner:
"A series of treewide cleanups to ensure interrupt request consistency.
- Add the missing IRQF_COND_ONESHOT flag to devm_request_irq()
This is inconsistent vs request_irq() and causes the same issues
which where addressed with the introduction of this flag
- Cleanup IRQF_ONESHOT and IRQF_NO_THREAD usage
Quite some drivers have inconsistent interrupt request flags
related to interrupt threading namely IRQF_ONESHOT and
IRQF_NO_THREAD. This leads to warnings and/or malfunction when
forced interrupt threading is enabled.
- Remove stub primary (hard interrupt) handlers
A bunch of drivers implement a stub primary (hard interrupt)
handler which just returns IRQ_WAKE_THREAD. The same functionality
is provided by the core code when the primary handler argument of
request_thread_irq() is set to NULL"
* tag 'irq-cleanups-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
media: pci: mg4b: Use IRQF_NO_THREAD
mfd: wm8350-core: Use IRQF_ONESHOT
thermal/qcom/lmh: Replace IRQF_ONESHOT with IRQF_NO_THREAD
rtc: amlogic-a4: Remove IRQF_ONESHOT
usb: typec: fusb302: Remove IRQF_ONESHOT
EDAC/altera: Remove IRQF_ONESHOT
char: tpm: cr50: Remove IRQF_ONESHOT
ARM: versatile: Remove IRQF_ONESHOT
scsi: efct: Use IRQF_ONESHOT and default primary handler
Bluetooth: btintel_pcie: Use IRQF_ONESHOT and default primary handler
bus: fsl-mc: Use default primary handler
mailbox: bcm-ferxrm-mailbox: Use default primary handler
iommu/amd: Use core's primary handler and set IRQF_ONESHOT
platform/x86: int0002: Remove IRQF_ONESHOT from request_irq()
genirq: Set IRQF_COND_ONESHOT in devm_request_irq().
|
|
'core' into next
|
|
With concurrent TLB invalidations, completion wait randomly gets timed out
because cmd_sem_val was incremented outside the IOMMU spinlock, allowing
CMD_COMPL_WAIT commands to be queued out of sequence and breaking the
ordering assumption in wait_on_sem().
Move the cmd_sem_val increment under iommu->lock so completion sequence
allocation is serialized with command queuing.
And remove the unnecessary return.
Fixes: d2a0cac10597 ("iommu/amd: move wait_on_sem() out of spinlock")
Tested-by: Srikanth Aithal <sraithal@amd.com>
Reported-by: Srikanth Aithal <sraithal@amd.com>
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
request_threaded_irq() is invoked with a primary and a secondary handler
and no flags are passed. The primary handler is the same as
irq_default_primary_handler() so there is no need to have an identical
copy.
The lack of the IRQF_ONESHOT can be dangerous because the interrupt
source is not masked while the threaded handler is active. This means,
especially on LEVEL typed interrupt lines, the interrupt can fire again
before the threaded handler had a chance to run.
Use the default primary interrupt handler by specifying NULL and set
IRQF_ONESHOT so the interrupt source is masked until the secondary
handler is done.
Fixes: 72fe00f01f9a3 ("x86/amd-iommu: Use threaded interupt handler")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260128095540.863589-4-bigeasy@linutronix.de
|
|
When building with -Wincompatible-function-pointer-types-strict, a
warning designed to catch kernel control flow integrity (kCFI) issues at
build time, there is an instance around amd_iommufd_hw_info():
drivers/iommu/amd/iommu.c:3141:13: error: incompatible function pointer types initializing 'void *(*)(struct device *, u32 *, enum iommu_hw_info_type *)' (aka 'void *(*)(struct device *, unsigned int *, enum iommu_hw_info_type *)') with an expression of type 'void *(struct device *, u32 *, u32 *)' (aka 'void *(struct device *, unsigned int *, unsigned int *)') [-Werror,-Wincompatible-function-pointer-types-strict]
3141 | .hw_info = amd_iommufd_hw_info,
| ^~~~~~~~~~~~~~~~~~~
While 'u32 *' and 'enum iommu_hw_info_type *' are ABI compatible, hence
no regular warning from -Wincompatible-function-pointer-types, the
mismatch will trigger a kCFI violation when amd_iommufd_hw_info() is
called indirectly.
Update the type parameter of amd_iommufd_hw_info() to be
'enum iommu_hw_info_type *' to match the prototype in
'struct iommu_ops', clearing up the warning and kCFI violation.
Fixes: 7d8b06ecc45b ("iommu/amd: Add support for hw_info for iommu capability query")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
This fixes warning reported by 0-DAY CI Kernel Test Service.
Fixes: 757d2b1fdf5b ("iommu/amd: Introduce gDomID-to-hDomID Mapping and handle parent domain invalidation")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202601190634.bl7Mjx5Q-lkp@intel.com/
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Currently, the error path of amd_iommu_probe_device() unconditionally
references dev_data, which may not be initialized if an early failure
occurs (like iommu_init_device() fails).
Move the out_err label to ensure the function exits immediately on
failure without accessing potentially uninitialized dev_data.
Fixes: 19e5cc156cb ("iommu/amd: Enable support for up to 2K interrupts per function")
Cc: Rakuram Eswaran <rakuram.e96@gmail.com>
Cc: Jörg Rödel <joro@8bytes.org>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202512191724.meqJENXe-lkp@intel.com/
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Introduce set_dte_nested() to program guest translation settings in
the host DTE when attaches the nested domain to a device.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Introduce the amd_iommu_set_dte_v1() helper function to configure
IOMMU host (v1) page table into DTE. This will be used later
when attaching nested doamin.
Also, remove obsolete warning when SNP is enabled and domain id
is zero since this check is no longer applicable.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
amd_iommu_make_clear_dte()
To help avoid duplicate logic when programing DTE for nested translation.
Note that this commit changes behavior of when the IOMMU driver is
switching domain during attach and the blocking domain, where DTE bit
fields for interrupt pass-through (i.e. Lint0, Lint1, NMI, INIT, ExtInt)
and System management message could be affected. These DTE bits are
specified in the IVRS table for specific devices, and should be persistent.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
invalidation
Each nested domain is assigned guest domain ID (gDomID), which guest OS
programs into guest Device Table Entry (gDTE). For each gDomID, the driver
assigns a corresponding host domain ID (hDomID), which will be programmed
into the host Device Table Entry (hDTE).
The hDomID is allocated during amd_iommu_alloc_domain_nested(),
and free during nested_domain_free(). The gDomID-to-hDomID mapping info
(struct guest_domain_mapping_info) is stored in a per-viommu xarray
(struct amd_iommu_viommu.gdomid_array), which is indexed by gDomID.
Note also that parent domain can be shared among struct iommufd_viommu.
Therefore, when hypervisor invalidates the nest parent domain, the AMD
IOMMU command INVALIDATE_IOMMU_PAGES must be issued for each hDomID in
the gdomid_array. This is handled by the iommu_flush_pages_v1_hdom_ids(),
where it iterates through struct protection_domain.viommu_list.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The nested domain is allocated with IOMMU_DOMAIN_NESTED type to store
stage-1 translation (i.e. GVA->GPA). This includes the GCR3 root pointer
table along with guest page tables. The struct iommu_hwpt_amd_guest
contains this information, and is passed from user-space as a parameter
of the struct iommu_ops.domain_alloc_nested().
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Which stores reference to nested parent domain assigned during the call to
struct iommu_ops.viommu_init(). Information in the nest parent is needed
when setting up the nested translation.
Note that the viommu initialization will be introduced in subsequent
commit.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
To support nested translation, the nest parent domain is allocated with
IOMMU_HWPT_ALLOC_NEST_PARENT flag, and stores information of the v1 page
table for stage 2 (i.e. GPA->SPA).
Also, only support nest parent domain on AMD system, which can support
the Guest CR3 Table (GCR3TRPMode) feature. This feature is required in
order to program DTE[GCR3 Table Root Pointer] with the GPA.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The GCR3TRPMode feature allows the DTE[GCR3TRP] field to be configured
with GPA (instead of SPA). This simplifies the implementation, and is
a pre-requisite for nested translation support.
Therefore, always enable this feature if available.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Which includes DTE update, clone_aliases, DTE flush and completion-wait
commands to avoid code duplication when reuse to setup DTE for nested
translation.
Also, make amd_iommu_update_dte() non-static to reuse in
in a new nested.c file for nested translation.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
This will be reused in a new nested.c file for nested translation.
Also, remove unused function parameter ptr.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Also change the define to use GENMASK_ULL instead.
There is no functional change.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
AMD IOMMU Extended Feature (EFR) and Extended Feature 2 (EFR2) registers
specify features supported by each IOMMU hardware instance.
The IOMMU driver checks each feature-specific bits before enabling
each feature at run time.
For IOMMUFD, the hypervisor passes the raw value of amd_iommu_efr and
amd_iommu_efr2 to VMM via iommufd IOMMU_DEVICE_GET_HW_INFO ioctl.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
alloc_irq_table() contains a conditional check for a NULL iommu pointer
when computing the NUMA node, but the function dereferences iommu
in multiple places afterwards.
All callers ensure that a valid iommu pointer is passed in, and a NULL
iommu is not expected by the current callers. Remove the incorrect
NULL check to make the assumptions consistent and address the Smatch
warning.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202512191724.meqJENXe-lkp@intel.com/
Signed-off-by: Rakuram Eswaran <rakuram.e96@gmail.com>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
With iommu.strict=1, the existing completion wait path can cause soft
lockups under stressed environment, as wait_on_sem() busy-waits under the
spinlock with interrupts disabled.
Move the completion wait in iommu_completion_wait() out of the spinlock.
wait_on_sem() only polls the hardware-updated cmd_sem and does not require
iommu->lock, so holding the lock during the busy wait unnecessarily
increases contention and extends the time with interrupts disabled.
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
So that both iommu.c and init.c can utilize them. Also define a new
function 'pdom_id_destroy()' to destroy 'pdom_ids' instead of directly
calling ida functions.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Currently AMD IOMMU driver does not reserve domain ids programmed in the
DTE while reusing the device table inside kdump kernel. This can cause
reallocation of these domain ids for newer domains that are created by
the kdump kernel, which can lead to potential IO_PAGE_FAULTs
Hence reserve these ids inside pdom_ids.
Fixes: 38e5f33ee359 ("iommu/amd: Reuse device table for kdump")
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm
Pull PCIe Link Encryption and Device Authentication from Dan Williams:
"New PCI infrastructure and one architecture implementation for PCIe
link encryption establishment via platform firmware services.
This work is the result of multiple vendors coming to consensus on
some core infrastructure (thanks Alexey, Yilun, and Aneesh!), and
three vendor implementations, although only one is included in this
pull. The PCI core changes have an ack from Bjorn, the crypto/ccp/
changes have an ack from Tom, and the iommu/amd/ changes have an ack
from Joerg.
PCIe link encryption is made possible by the soup of acronyms
mentioned in the shortlog below. Link Integrity and Data Encryption
(IDE) is a protocol for installing keys in the transmitter and
receiver at each end of a link. That protocol is transported over Data
Object Exchange (DOE) mailboxes using PCI configuration requests.
The aspect that makes this a "platform firmware service" is that the
key provisioning and protocol is coordinated through a Trusted
Execution Envrionment (TEE) Security Manager (TSM). That is either
firmware running in a coprocessor (AMD SEV-TIO), or quasi-hypervisor
software (Intel TDX Connect / ARM CCA) running in a protected CPU
mode.
Now, the only reason to ask a TSM to run this protocol and install the
keys rather than have a Linux driver do the same is so that later, a
confidential VM can ask the TSM directly "can you certify this
device?".
That precludes host Linux from provisioning its own keys, because host
Linux is outside the trust domain for the VM. It also turns out that
all architectures, save for one, do not publish a mechanism for an OS
to establish keys in the root port. So "TSM-established link
encryption" is the only cross-architecture path for this capability
for the foreseeable future.
This unblocks the other arch implementations to follow in v6.20/v7.0,
once they clear some other dependencies, and it unblocks the next
phase of work to implement the end-to-end flow of confidential device
assignment. The PCIe specification calls this end-to-end flow Trusted
Execution Environment (TEE) Device Interface Security Protocol
(TDISP).
In the meantime, Linux gets a link encryption facility which has
practical benefits along the same lines as memory encryption. It
authenticates devices via certificates and may protect against
interposer attacks trying to capture clear-text PCIe traffic.
Summary:
- Introduce the PCI/TSM core for the coordination of device
authentication, link encryption and establishment (IDE), and later
management of the device security operational states (TDISP).
Notify the new TSM core layer of PCI device arrival and departure
- Add a low level TSM driver for the link encryption establishment
capabilities of the AMD SEV-TIO architecture
- Add a library of helpers TSM drivers to use for IDE establishment
and the DOE transport
- Add skeleton support for 'bind' and 'guest_request' operations in
support of TDISP"
* tag 'tsm-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm: (23 commits)
crypto/ccp: Fix CONFIG_PCI=n build
virt: Fix Kconfig warning when selecting TSM without VIRT_DRIVERS
crypto/ccp: Implement SEV-TIO PCIe IDE (phase1)
iommu/amd: Report SEV-TIO support
psp-sev: Assign numbers to all status codes and add new
ccp: Make snp_reclaim_pages and __sev_do_cmd_locked public
PCI/TSM: Add 'dsm' and 'bound' attributes for dependent functions
PCI/TSM: Add pci_tsm_guest_req() for managing TDIs
PCI/TSM: Add pci_tsm_bind() helper for instantiating TDIs
PCI/IDE: Initialize an ID for all IDE streams
PCI/IDE: Add Address Association Register setup for downstream MMIO
resource: Introduce resource_assigned() for discerning active resources
PCI/TSM: Drop stub for pci_tsm_doe_transfer()
drivers/virt: Drop VIRT_DRIVERS build dependency
PCI/TSM: Report active IDE streams
PCI/IDE: Report available IDE streams
PCI/IDE: Add IDE establishment helpers
PCI: Establish document for PCI host bridge sysfs attributes
PCI: Add PCIe Device 3 Extended Capability enumeration
PCI/TSM: Establish Secure Sessions and Link Encryption
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC driver updates from Arnd Bergmann:
"This is the first half of the driver changes:
- A treewide interface change to the "syscore" operations for power
management, as a preparation for future Tegra specific changes
- Reset controller updates with added drivers for LAN969x, eic770 and
RZ/G3S SoCs
- Protection of system controller registers on Renesas and Google
SoCs, to prevent trivially triggering a system crash from e.g.
debugfs access
- soc_device identification updates on Nvidia, Exynos and Mediatek
- debugfs support in the ST STM32 firewall driver
- Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI
- Cleanups for memory controller support on Nvidia and Renesas"
* tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (114 commits)
memory: tegra186-emc: Fix missing put_bpmp
Documentation: reset: Remove reset_controller_add_lookup()
reset: fix BIT macro reference
reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe
reset: th1520: Support reset controllers in more subsystems
reset: th1520: Prepare for supporting multiple controllers
dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys
dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets
reset: remove legacy reset lookup code
clk: davinci: psc: drop unused reset lookup
reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC
reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY
dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support
reset: eswin: Add eic7700 reset driver
dt-bindings: reset: eswin: Documentation for eic7700 SoC
reset: sparx5: add LAN969x support
dt-bindings: reset: microchip: Add LAN969x support
soc: rockchip: grf: Add select correct PWM implementation on RK3368
soc/tegra: pmc: Add USB wake events for Tegra234
amba: tegra-ahb: Fix device leak on SMMU enable
...
|
|
The SEV-TIO switch in the AMD BIOS is reported to the OS via
the IOMMU Extended Feature 2 register (EFR2), bit 1.
Add helper to parse the bit and report the feature presence.
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Link: https://patch.msgid.link/20251202024449.542361-4-aik@amd.com
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
'nvidia/tegra', 'intel/vt-d', 'amd/amd-vi' and 'core' into next
|
|
If the IOVA is limited to less than 48 the page table will be constructed
with a 3 level configuration which is unsupported by hardware.
Like the second stage the caller needs to pass in both the top_level an
the vasz to specify a table that has more levels than required to hold the
IOVA range.
Fixes: 6cbc09b7719e ("iommu/vt-d: Restore previous domain::aperture_end calculation")
Reported-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
modify_irte_ga()
The return type of __modify_irte_ga() is int, but modify_irte_ga()
treats it as a bool. Casting the int to bool discards the error code.
To fix the issue, change the type of ret to int in modify_irte_ga().
Fixes: 57cdb720eaa5 ("iommu/amd: Do not flush IRTE when only updating isRun and destination fields")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Several drivers can benefit from registering per-instance data along
with the syscore operations. To achieve this, move the modifiable fields
out of the syscore_ops structure and into a separate struct syscore that
can be registered with the framework. Add a void * driver data field for
drivers to store contextual data that will be passed to the syscore ops.
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
|
|
Fix a memory leak of struct amd_iommu_pci_segment in alloc_pci_segment()
when system memory (or contiguous memory) is insufficient.
Fixes: 04230c119930 ("iommu/amd: Introduce per PCI segment device table")
Fixes: eda797a27795 ("iommu/amd: Introduce per PCI segment rlookup table")
Fixes: 99fc4ac3d297 ("iommu/amd: Introduce per PCI segment alias_table")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Current IOMMU driver prints "Completion-wait Time-out" error message with
insufficient information to further debug the issue.
Enhancing the error message as following:
1. Log IOMMU PCI device ID in the error message.
2. With "amd_iommu_dump=1" kernel command line option, dump entire
command buffer entries including Head and Tail offset.
Dump the entire command buffer only on the first 'Completion-wait Time-out'
to avoid dmesg spam.
Signed-off-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
None of this is used anymore, delete it.
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Replace the io_pgtable versions with pt_iommu versions. The v2 page table
uses the x86 implementation that will be eventually shared with VT-d.
This supports the same special features as the original code:
- increase_top for the v1 format to allow scaling from 3 to 6 levels
- non-present flushing
- Dirty tracking for v1 only
- __sme_set() to adjust the PTEs for CC
- Optimization for flushing with virtualization to minimize the range
- amd_iommu_pgsize_bitmap override of the native page sizes
- page tables allocate from the device's NUMA node
Rework the domain ops so that v1/v2 get their own ops. Make dedicated
allocation functions for v1 and v2. Hook up invalidation for a top change
to struct pt_iommu_flush_ops. Delete some of the iopgtable related code
that becomes unused in this patch. The next patch will delete the rest of
it.
This fixes a race bug in AMD's increase_address_space() implementation. It
stores the top level and top pointer in different memory, which prevents
other threads from reading a coherent version:
increase_address_space() alloc_pte()
level = pgtable->mode - 1;
pgtable->root = pte;
pgtable->mode += 1;
pte = &pgtable->root[PM_LEVEL_INDEX(level, address)];
The iommupt version is careful to put mode and root under a single
READ_ONCE and then is careful to only READ_ONCE a single time per
walk.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
In iommu_mmio_write(), it validates the user-provided offset with the
check: `iommu->dbg_mmio_offset > iommu->mmio_phys_end - 4`.
This assumes a 4-byte access. However, the corresponding
show handler, iommu_mmio_show(), uses readq() to perform an 8-byte
(64-bit) read.
If a user provides an offset equal to `mmio_phys_end - 4`, the check
passes, and will lead to a 4-byte out-of-bounds read.
Fix this by adjusting the boundary check to use sizeof(u64), which
corresponds to the size of the readq() operation.
Fixes: 7a4ee419e8c1 ("iommu/amd: Add debugfs support to dump IOMMU MMIO registers")
Signed-off-by: Songtang Liu <liusongtang@bytedance.com>
Reviewed-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The IOMMU core attaches each device to a default domain on probe(). Then,
every new "attach" operation has a fundamental meaning of two-fold:
- detach from its currently attached (old) domain
- attach to a given new domain
Modern IOMMU drivers following this pattern usually want to clean up the
things related to the old domain, so they call iommu_get_domain_for_dev()
to fetch the old domain.
Pass in the old domain pointer from the core to drivers, aligning with the
set_dev_pasid op that does so already.
Ensure all low-level attach fcuntions in the core can forward the correct
old domain pointer. Thus, rework those functions as well.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The set_dev_pasid for a release domain never gets called anyhow. So, there
is no point in defining a separate release_domain from the blocked_domain.
Simply reuse the blocked_domain.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|