| Age | Commit message (Collapse) | Author | Files | Lines |
|
In xe_hw_engine_group_get_mode(), a write lock is acquired before
calling switch_mode(), which in turn invokes
xe_hw_engine_group_suspend_faulting_lr_jobs().
On failure inside xe_hw_engine_group_suspend_faulting_lr_jobs(),
the write lock is released there, and then again in
xe_hw_engine_group_get_mode(), leading to a double release.
Fix this by keeping both acquire and release operation in
xe_hw_engine_group_get_mode().
Fixes: 770bd1d34113 ("drm/xe/hw_engine_group: Ensure safe transition between execution modes")
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Francois Dugast <francois.dugast@intel.com>
Link: https://lore.kernel.org/r/20250925023145.1203004-2-shuicheng.lin@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 662d98b8b373007fa1b08ba93fee11f6fd3e387c)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
Currently this is hidden behind perfmon_capable() since this is
technically an info leak, given that this is a system wide metric.
However the granularity reported here is always PAGE_SIZE aligned, which
matches what the core kernel is already willing to expose to userspace
if querying how many free RAM pages there are on the system, and that
doesn't need any special privileges. In addition other drm drivers seem
happy to expose this.
The motivation here if with oneAPI where they want to use the system
wide 'used' reporting here, so not the per-client fdinfo stats. This has
also come up with some perf overlay applications wanting this
information.
Fixes: 1105ac15d2a1 ("drm/xe/uapi: restrict system wide accounting")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Joshua Santosh <joshua.santosh.ranjan@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250919122052.420979-2-matthew.auld@intel.com
(cherry picked from commit 4d0b035fd6dae8ee48e9c928b10f14877e595356)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
Initialize the uval variable to 0 in xe_late_bind_fw_num_fans() to fix
a potential use of uninitialized variable warning and ensure predictable
behavior.
The variable is passed by reference to xe_pcode_read() which should
populate it on success, but initializing it to 0 provides a safe
default value and follows kernel coding best practices.
v2:
- uval = 0 which serves as both a safe default and the fallback
value when the pcode read operation fails.
v3:
- Handle MMIO failure (Rodrigo)
- The function should probably return the error and make the uval as
pointer-argument, like the pcode_read.
- Change the caller of this function to propagate the error
upwards if mmio failed.
Fixes: 45832bf9c10f3 ("drm/xe/xe_late_bind_fw: Initialize late binding firmware")
Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
Link: https://lore.kernel.org/r/20251002005648.3185636-1-mallesh.koujalagi@intel.com
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 07abc16c14693df703763c45e9fc0abfefc927d5)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
When userptr is used on SVM-enabled VMs, a non-NULL
hmm_range::dev_private_owner value might mean that
hmm_range_fault() attempts to return device private pages.
Either that will fail, or the userptr code will not know
how to handle those.
Use NULL for hmm_range::dev_private_owner to migrate
such pages to system. In order to do that, move the
struct drm_gpusvm::device_private_page_owner field to
struct drm_gpusvm_ctx::device_private_page_owner so that
it doesn't remain immutable over the drm_gpusvm lifetime.
v2:
- Don't conditionally compile xe_svm_devm_owner().
- Kerneldoc xe_svm_devm_owner().
Fixes: 9e9787414882 ("drm/xe/userptr: replace xe_hmm with gpusvm")
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://lore.kernel.org/r/20250930122752.96034-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit ad298d9ec957414dbf3d51f3c8bca4b6d2416c0c)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
The variable offset is not being initialized, and it is only set inside
a for-loop if entry->name is the same as manifest_entry. In the case
where it is not initialized a non-zero check on offset is potentialy checking
a bogus uninitalized value. Fix this by initializing offset to zero.
Fixes: efa29317a553 ("drm/xe/xe_late_bind_fw: Extract and print version info")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://lore.kernel.org/r/20250924102208.9216-1-colin.i.king@gmail.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 20f3b28e2e07747fd27301f0f5deb3cb569ee15c)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
Before calling ttm_bo_populate() in the CPU fault path of a bo,
we assert that the bo is not being migrated. However, for
local bos we share the reservation object with other local bos
that might be in the process of being migrated. Also some VM
operations may attach USAGE_KERNEL fences to the common
reservation object and trigger false positives from the assert.
So remove the assert and instead wait for bo idle. This may
unnecessarily wait for idle in some cases but since we're
doing this wait later in the fault path anyway we might as
well do it here as well.
This fixes warnings like:
Sep 25 14:56:23 desky kernel: ------------[ cut here ]------------
Sep 25 14:56:23 desky kernel: xe 0000:03:00.0: [drm] Assertion `dma_resv_test_signaled(tbo->base.resv, DMA_RESV_USAGE_KERNEL) || (tbo->ttm && ttm_tt_is_populated(tbo->ttm))` failed!
platform: BATTLEMAGE subplatform: 1
graphics: Xe2_HPG 20.01 step A0
media: Xe2_HPM 13.01 step A1
Sep 25 14:56:23 desky kernel: WARNING: CPU: 6 PID: 24767 at drivers/gpu/drm/xe/xe_bo.c:1748 xe_bo_fault_migrate+0x1bb/0x300 [xe]
Sep 25 14:56:23 desky kernel: Modules linked in: cpuid dm_crypt xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc xfrm_user xfr>
Sep 25 14:56:23 desky kernel: snd_soc_sdca snd_seq_midi prime_numbers coretemp snd_seq_midi_event drm_ttm_helper snd_hda_codec drm_buddy drm_exec snd_rawmidi snd_soc_core snd_hda_cor>
Sep 25 14:56:23 desky kernel: CPU: 6 UID: 1000 PID: 24767 Comm: steamwebhelper Tainted: G U W 6.17.0-rc7+ #32 PREEMPT(voluntary)
Sep 25 14:56:23 desky kernel: Tainted: [U]=USER, [W]=WARN
Sep 25 14:56:23 desky kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D36/PRO Z690-P DDR4 (MS-7D36), BIOS A.A1 10/18/2022
Sep 25 14:56:23 desky kernel: RIP: 0010:xe_bo_fault_migrate+0x1bb/0x300 [xe]
Sep 25 14:56:23 desky kernel: Code: fa 64 29 f9 48 c7 c7 40 e0 d3 c1 51 48 c7 c1 c0 e3 d3 c1 52 4c 8b 45 c0 41 50 44 8b 4d c8 4d 89 e0 48 8b 55 a8 e8 25 27 95 ef <0f> 0b 48 83 c4 40 4>
Sep 25 14:56:23 desky kernel: RSP: 0000:ffffae1ca88c7b10 EFLAGS: 00010286
Sep 25 14:56:23 desky kernel: RAX: 0000000000000000 RBX: ffff8d7cfd7e6800 RCX: 0000000000000027
Sep 25 14:56:23 desky kernel: RDX: ffff8d845019cec8 RSI: 0000000000000001 RDI: ffff8d845019cec0
Sep 25 14:56:23 desky kernel: RBP: ffffae1ca88c7bc8 R08: 0000000000000000 R09: 0000000000000000
Sep 25 14:56:23 desky kernel: R10: 0000000000000000 R11: 0000000000000004 R12: ffffffffc1db1faa
Sep 25 14:56:23 desky kernel: R13: ffffffffc1db2ab4 R14: 0000000000000001 R15: ffffae1ca88c7bd8
Sep 25 14:56:23 desky kernel: FS: 00007fb1baf31940(0000) GS:ffff8d849c870000(0000) knlGS:0000000000000000
Sep 25 14:56:23 desky kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 25 14:56:23 desky kernel: CR2: 00007fb1b2860020 CR3: 00000001705a9004 CR4: 0000000000772ef0
Sep 25 14:56:23 desky kernel: PKRU: 55555558
Sep 25 14:56:23 desky kernel: Call Trace:
Sep 25 14:56:23 desky kernel: <TASK>
Sep 25 14:56:23 desky kernel: xe_bo_cpu_fault_fastpath+0x11e/0x220 [xe]
Sep 25 14:56:23 desky kernel: xe_bo_cpu_fault+0x84/0x410 [xe]
Sep 25 14:56:23 desky kernel: ? __x64_sys_mmap+0x33/0x50
Sep 25 14:56:23 desky kernel: ? x64_sys_call+0x1b2e/0x20d0
Sep 25 14:56:23 desky kernel: ? do_syscall_64+0x9d/0x1f0
Sep 25 14:56:23 desky kernel: ? __check_object_size+0x4a/0x2e0
Sep 25 14:56:23 desky kernel: __do_fault+0x36/0x190
Sep 25 14:56:23 desky kernel: do_fault+0xcf/0x570
Sep 25 14:56:23 desky kernel: __handle_mm_fault+0x92b/0xfe0
Sep 25 14:56:23 desky kernel: ? ktime_get_mono_fast_ns+0x39/0xd0
Sep 25 14:56:23 desky kernel: handle_mm_fault+0x164/0x2c0
Sep 25 14:56:23 desky kernel: do_user_addr_fault+0x2cb/0x840
Sep 25 14:56:23 desky kernel: exc_page_fault+0x75/0x180
Sep 25 14:56:23 desky kernel: asm_exc_page_fault+0x27/0x30
Sep 25 14:56:23 desky kernel: RIP: 0033:0x7fb1bc388bb7
Sep 25 14:56:23 desky kernel: Code: 48 ff c7 48 01 fe 48 8d 54 11 80 0f 1f 84 00 00 00 00 00 c5 fe 6f 0e c5 fe 6f 56 20 c5 fe 6f 5e 40 c5 fe 6f 66 60 48 83 ee 80 <c5> fd 7f 0f c5 fd 7>
Sep 25 14:56:23 desky kernel: RSP: 002b:00007ffd7814fad8 EFLAGS: 00010207
Sep 25 14:56:23 desky kernel: RAX: 00007fb1b2860000 RBX: 0000000000000690 RCX: 00007fb1b2860000
Sep 25 14:56:23 desky kernel: RDX: 00007fb1b2860610 RSI: 0000556eda79f4c0 RDI: 00007fb1b2860020
Sep 25 14:56:23 desky kernel: RBP: 00007ffd7814fb60 R08: 0000000000000000 R09: 000000012be0e000
Sep 25 14:56:23 desky kernel: R10: 00007fb1b2860000 R11: 0000000000000246 R12: 0000556edd39a240
Sep 25 14:56:23 desky kernel: R13: 00007fb1b2dcb010 R14: 0000556eda79f420 R15: 0000000000000000
Sep 25 14:56:23 desky kernel: </TASK>
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5250
Fixes: c2ae94cf8cd8 ("drm/xe: Convert the CPU fault handler for exhaustive eviction")
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250929112649.6131-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit 8f1756a7ea33b352a54e6f53d76c552b3a424187)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
In general, the VFs can't load firmwares so attempt to initialize
the firmware late-bind component leads to errors like:
[] xe 0000:03:00.1: [drm] *ERROR* Late bind component not bound
Fixes: 918bd789d62e ("drm/xe/xe_late_bind_fw: Introduce xe_late_bind_fw")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6190
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Badal Nilawar <badal.nilawar@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Link: https://lore.kernel.org/r/20250928174811.198933-3-michal.wajdeczko@intel.com
(cherry picked from commit e35e288090f362be88d77b60d9846cea15df173e)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
This is a VF only function and its name should reflect that to
avoid any confusion. Move the VF check to the caller side.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20250928174811.198933-2-michal.wajdeczko@intel.com
(cherry picked from commit b88bb1eefa88f0cefc00fe5e78b1186cd8f9db78)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
Spell out the syntax instead of only using examples. Particularly
important the <engine-class> part since that's different than
engines_allowed and may confuse users. The same batch buffer is used for
all engines of a certain class.
Cc: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Fixes: e2a9854d806e ("drm/xe/configfs: Allow to select by class only")
Link: https://lore.kernel.org/r/20250924152709.659483-4-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 47ca7acff4011fa322853a3612f464b959e88210)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
If mask is NULL, only the engine class should be accepted, so the
pattern string should be completely parsed. This should fix passing e.g.
rcs0 to ctx_restore_post_bb when it's only expecting the engine class.
Reported-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Closes: https://lore.kernel.org/r/20250922155544.67712-1-jonathan.cavitt@intel.com
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/aNJKnrCQmL9xS9Gv@stanley.mountain
Fixes: e2a9854d806e ("drm/xe/configfs: Allow to select by class only")
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Link: https://lore.kernel.org/r/20250924152709.659483-3-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit dd797967160b79cc0ca2d2eb05fc55436b66dce0)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
The following error was reported when building with clang 16.0.6:
In file included from drivers/gpu/drm/xe/xe_pci.c:1104:
>> drivers/gpu/drm/xe/tests/xe_pci.c:214:2: error: initializer \
element is not a compile-time constant
graphics_ip_xelp,
^~~~~~~~~~~~~~~~
drivers/gpu/drm/xe/tests/xe_pci.c:221:2: error: initializer \
element is not a compile-time constant
media_ip_xem,
^~~~~~~~~~~~
2 errors generated.
Fix that by explicit re-definition of pre-GMDID IPs, as there are
not so many of them.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202509192041.tQwdE4DS-lkp@intel.com/
Fixes: 5bb5258e357e ("drm/xe/tests: Add pre-GMDID IP descriptors to param generators")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250922101207.192028-1-michal.wajdeczko@intel.com
(cherry picked from commit 2de80e2da74b402a9d838b8e729cd01cf94cdcbc)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
[Why]
When DC adds common modes it adds modes with a string to match what
they are. Non-DC doesn't. This can be inconsistent when turning on/off
DC support.
[How]
Add a name member to common_modes[] and copy it into the drm display
mode.
Cc: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Link: https://lore.kernel.org/r/20250924161624.1975819-6-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why]
DC and non-DC codepaths have different sets of common modes that are
added for eDP and LVDS cases. This can cause different behaviors for
turning on DC on hardware that can support both.
[How]
Drop extra modes from amdgpu_connector_add_common_modes() not present
in amdgpu_dm_connector_add_common_modes().
Cc: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250924161624.1975819-5-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
To better describe what it does.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3756
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
amdgpu_connector_add_common_modes()
[Why]
Adding or removing a mode from common_modes[] can be fragile if a user
forgot to update the for loop boundaries.
[How]
Use ARRAY_SIZE() to detect size of the array and use that instead.
Cc: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Link: https://lore.kernel.org/r/20250924161624.1975819-4-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
The dce100_validate_global function was verbatim exactly the
same as dce60_validate_global and dce80_validate_global.
Share dce100_validate_global between DCE6-10 to save code size.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
DCE6-8 have very similar capabilities to DCE10, they support the
same DP and HDMI versions and work similarly.
Share dce100_validate_bandwidth between DCE6-10 to reduce code
duplication in the DC driver.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This commit fixes a potential race condition in the userqueue fence
signaling mechanism by replacing dma_fence_is_signaled_locked() with
dma_fence_is_signaled().
The issue occurred because:
1. dma_fence_is_signaled_locked() should only be used when holding
the fence's individual lock, not just the fence list lock
2. Using the locked variant without the proper fence lock could lead
to double-signaling scenarios:
- Hardware completion signals the fence
- Software path also tries to signal the same fence
By using dma_fence_is_signaled() instead, we properly handle the
locking hierarchy and avoid the race condition while still maintaining
the necessary synchronization through the fence_list_lock.
v2: drop the comment (Christian)
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
current switch partition only check if kfd_processes_table is empty.
kfd_prcesses_table entry is deleted in kfd_process_notifier_release, but
kfd_process tear down is in kfd_process_wq_release.
consider two processes:
Process A (workqueue) -> kfd_process_wq_release -> Access kfd_node member
Process B switch partition -> amdgpu_xcp_pre_partition_switch -> amdgpu_amdkfd_device_fini_sw
-> kfd_node tear down.
Process A and B may trigger a race as shown in dmesg log.
This patch is to resolve the race by adding an atomic kfd_process counter
kfd_processes_count, it increment as create kfd process, decrement as
finish kfd_process_wq_release.
v2: Put kfd_processes_count per kfd_dev, move decrement to kfd_process_destroy_pdds
and bug fix. (Philip Yang)
[3966658.307702] divide error: 0000 [#1] SMP NOPTI
[3966658.350818] i10nm_edac
[3966658.356318] CPU: 124 PID: 38435 Comm: kworker/124:0 Kdump: loaded Tainted
[3966658.356890] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[3966658.362839] nfit
[3966658.366457] RIP: 0010:kfd_get_num_sdma_engines+0x17/0x40 [amdgpu]
[3966658.366460] Code: 00 00 e9 ac 81 02 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 48 8b 4f 08 48 8b b7 00 01 00 00 8b 81 58 26 03 00 99 <f7> be b8 01 00 00 80 b9 70 2e 00 00 00 74 0b 83 f8 02 ba 02 00 00
[3966658.380967] x86_pkg_temp_thermal
[3966658.391529] RSP: 0018:ffffc900a0edfdd8 EFLAGS: 00010246
[3966658.391531] RAX: 0000000000000008 RBX: ffff8974e593b800 RCX: ffff888645900000
[3966658.391531] RDX: 0000000000000000 RSI: ffff888129154400 RDI: ffff888129151c00
[3966658.391532] RBP: ffff8883ad79d400 R08: 0000000000000000 R09: ffff8890d2750af4
[3966658.391532] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000000
[3966658.391533] R13: ffff8883ad79d400 R14: ffffe87ff662ba00 R15: ffff8974e593b800
[3966658.391533] FS: 0000000000000000(0000) GS:ffff88fe7f600000(0000) knlGS:0000000000000000
[3966658.391534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3966658.391534] CR2: 0000000000d71000 CR3: 000000dd0e970004 CR4: 0000000002770ee0
[3966658.391535] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3966658.391535] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[3966658.391536] PKRU: 55555554
[3966658.391536] Call Trace:
[3966658.391674] deallocate_sdma_queue+0x38/0xa0 [amdgpu]
[3966658.391762] process_termination_cpsch+0x1ed/0x480 [amdgpu]
[3966658.399754] intel_powerclamp
[3966658.402831] kfd_process_dequeue_from_all_devices+0x5b/0xc0 [amdgpu]
[3966658.402908] kfd_process_wq_release+0x1a/0x1a0 [amdgpu]
[3966658.410516] coretemp
[3966658.434016] process_one_work+0x1ad/0x380
[3966658.434021] worker_thread+0x49/0x310
[3966658.438963] kvm_intel
[3966658.446041] ? process_one_work+0x380/0x380
[3966658.446045] kthread+0x118/0x140
[3966658.446047] ? __kthread_bind_mask+0x60/0x60
[3966658.446050] ret_from_fork+0x1f/0x30
[3966658.446053] Modules linked in: kpatch_20765354(OEK)
[3966658.455310] kvm
[3966658.464534] mptcp_diag xsk_diag raw_diag unix_diag af_packet_diag netlink_diag udp_diag act_pedit act_mirred act_vlan cls_flower kpatch_21951273(OEK) kpatch_18424469(OEK) kpatch_19749756(OEK)
[3966658.473462] idxd_mdev
[3966658.482306] kpatch_17971294(OEK) sch_ingress xt_conntrack amdgpu(OE) amdxcp(OE) amddrm_buddy(OE) amd_sched(OE) amdttm(OE) amdkcl(OE) intel_ifs iptable_mangle tcm_loop target_core_pscsi tcp_diag target_core_file inet_diag target_core_iblock target_core_user target_core_mod coldpgs kpatch_18383292(OEK) ip6table_nat ip6table_filter ip6_tables ip_set_hash_ipportip ip_set_hash_ipportnet ip_set_hash_ipport ip_set_bitmap_port xt_comment iptable_nat nf_nat iptable_filter ip_tables ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sn_core_odd(OE) i40e overlay binfmt_misc tun bonding(OE) aisqos(OE) aisqos_hotfixes(OE) rfkill uio_pci_generic uio cuse fuse nf_tables nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm idxd_mdev
[3966658.491237] vfio_pci
[3966658.501196] vfio_pci vfio_virqfd mdev vfio_iommu_type1 vfio iax_crypto intel_pmt_telemetry iTCO_wdt intel_pmt_class iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq
[3966658.508537] vfio_virqfd
[3966658.517569] snd_seq_device ipmi_ssif isst_if_mbox_pci isst_if_mmio pcspkr snd_pcm idxd intel_uncore ses isst_if_common intel_vsec idxd_bus enclosure snd_timer mei_me snd i2c_i801 i2c_smbus mei i2c_ismt soundcore joydev acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad vfat fat
[3966658.526851] mdev
[3966658.536096] nfsd auth_rpcgss nfs_acl lockd grace slb_vtoa(OE) sunrpc dm_mod hookers mlx5_ib(OE) ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm mlx5_core(OE) mlxfw(OE)
[3966658.540381] vfio_iommu_type1
[3966658.544341] nvme mpt3sas tls drm nvme_core pci_hyperv_intf raid_class psample libcrc32c crc32c_intel mlxdevm(OE) i2c_core
[3966658.551254] vfio
[3966658.558742] scsi_transport_sas wmi pinctrl_emmitsburg sd_mod t10_pi sg ahci libahci libata rdma_ucm(OE) ib_uverbs(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_umad(OE) ib_core(OE) ib_ucm(OE) mlx_compat(OE)
[3966658.563004] iax_crypto
[3966658.570988] [last unloaded: diagnose]
[3966658.571027] ---[ end trace cc9dbb180f9ae537 ]---
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Philip.Yang<Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
There is race in amdgpu_amdkfd_device_fini_sw and interrupt.
if amdgpu_amdkfd_device_fini_sw run in b/w kfd_cleanup_nodes and
kfree(kfd), and KGD interrupt generated.
kernel panic log:
BUG: kernel NULL pointer dereference, address: 0000000000000098
amdgpu 0000:c8:00.0: amdgpu: Requesting 4 partitions through PSP
PGD d78c68067 P4D d78c68067
kfd kfd: amdgpu: Allocated 3969056 bytes on gart
PUD 1465b8067 PMD @
Oops: @002 [#1] SMP NOPTI
kfd kfd: amdgpu: Total number of KFD nodes to be created: 4
CPU: 115 PID: @ Comm: swapper/115 Kdump: loaded Tainted: G S W OE K
RIP: 0010:_raw_spin_lock_irqsave+0x12/0x40
Code: 89 e@ 41 5c c3 cc cc cc cc 66 66 2e Of 1f 84 00 00 00 00 00 OF 1f 40 00 Of 1f 44% 00 00 41 54 9c 41 5c fa 31 cO ba 01 00 00 00 <fO> OF b1 17 75 Ba 4c 89 e@ 41 Sc
89 c6 e8 07 38 5d
RSP: 0018: ffffc90@1a6b0e28 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000018
0000000000000001 RSI: ffff8883bb623e00 RDI: 0000000000000098
ffff8883bb000000 RO8: ffff888100055020 ROO: ffff888100055020
0000000000000000 R11: 0000000000000000 R12: 0900000000000002
ffff888F2b97da0@ R14: @000000000000098 R15: ffff8883babdfo00
CS: 010 DS: 0000 ES: 0000 CRO: 0000000080050033
CR2: 0000000000000098 CR3: 0000000e7cae2006 CR4: 0000000002770ce0
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
0000000000000000 DR6: 00000000fffeO7FO DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
kgd2kfd_interrupt+@x6b/0x1f@ [amdgpu]
? amdgpu_fence_process+0xa4/0x150 [amdgpu]
kfd kfd: amdgpu: Node: 0, interrupt_bitmap: 3 YcpxFl Rant tErace
amdgpu_irq_dispatch+0x165/0x210 [amdgpu]
amdgpu_ih_process+0x80/0x100 [amdgpu]
amdgpu: Virtual CRAT table created for GPU
amdgpu_irq_handler+0x1f/@x60 [amdgpu]
__handle_irq_event_percpu+0x3d/0x170
amdgpu: Topology: Add dGPU node [0x74a2:0x1002]
handle_irq_event+0x5a/@xcO
handle_edge_irq+0x93/0x240
kfd kfd: amdgpu: KFD node 1 partition @ size 49148M
asm_call_irq_on_stack+0xf/@x20
</IRQ>
common_interrupt+0xb3/0x130
asm_common_interrupt+0x1le/0x40
5.10.134-010.a1i5000.a18.x86_64 #1
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Reject modes with a pixel clock higher than the maximum display
clock. Use 400 MHz as a fallback value when the maximum display
clock is not known. Pixel clocks that are higher than the display
clock just won't work and are not supported.
With the addition of the YUV422 fallback, DC can now accidentally
select a mode requiring higher pixel clock than actually supported
when the DP version supports the required bandwidth but the clock
is otherwise too high for the display engine. DCE 6-10 don't
support these modes but they don't have a bandwidth calculation
to reject them properly.
Fixes: db291ed1732e ("drm/amd/display: Add fallback path for YCBCR422")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why]
amdgpu_connector_add_common_modes() has a check for the width and height
of common modes being too small, but the array of common_modes[] has fixed
values. The check is dead code.
[How]
Drop unnecessary check.
Cc: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250924161624.1975819-3-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why]
The main reason common modes are added is for compatibility with
clone mode when a laptop is connected to a projector or external
monitor. Since commit 978fa2f6d0b12 ("drm/amd/display: Use scaling
for non-native resolutions on eDP") when non-native modes are picked
for eDP the GPU scalar will be used. This is because it is inconsistent
whether eDP panels have the capability to actually drive non-native
resolutions. With panels connected to other connectors this limitation
generally doesn't exist as we the EDID will advertise support for a
number of resolutions and monitors will use built in scaling hardware.
Comparing DC and non-DC code paths the non-DC code path only adds
common modes for LVDS and eDP whereas the DC codepath does it for
all connector types.
In the past there was an experiment done to disable common mode adding
for eDP and LVDS from commit 6d396e7ac1ce3 ("drm/amd/display: Disable
common modes for LVDS") and commit 7948afb46af92 ("drm/amd/display:
Disable common modes for eDP") but this was reverted in
commit a8b79b09185de ("drm/amd: Re-enable common modes for eDP and
LVDS") because it caused problems with Xorg.
[How]
Only add common modes for eDP and LVDS for DC, matching the behavior
of non-DC.
Suggested-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250924161624.1975819-2-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Variable "i" has been redeclared as integer later in the function
which is wrong and not serving any purpose.
Fixes: 899fbde14646 ("drm/amdgpu: replace get_user_pages with HMM mirror helpers")
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
It should return an error code if userq VA validation fails.
Fixes: 9e46b8bb0539 ("drm/amdgpu: validate userq buffer virtual address and size")
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This reverts commit e44a0fe630c58b0a87d8281f5c1077a3479e5fce.
Initially we used VMID reservation to enforce isolation between
processes. That has now been replaced by proper fence handling.
Both OpenGL, RADV and ROCm developers requested a way to reserve a VMID
for SPM, so restore that approach by reverting back to only allowing a
single process to use the reserved VMID.
Only compile tested for now.
v2: use -ENOENT instead of -EINVAL if VMID is not available
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Initially we enforced isolation by reserving a VMID, but that practice
was now removed.
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Add a fallback mechanism to attempt pipe reset when KCQ reset
fails to recover the ring. After performing the KCQ reset and
queue remapping, test the ring functionality. If the ring test
fails, initiate a pipe reset as an additional recovery step.
v2: fix the typo (Lijo)
v3: try pipeline reset when kiq mapping fails (Lijo)
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
On HL338 ASICs, the Infineon first‑stage firmware is not present and
the reported version is 0. In this case printing a version number is
misleading, as it suggests valid firmware when it does not exist.
Fix this by printing the first‑stage Infineon firmware version only
if the reported value is non‑zero. This avoids confusing or incorrect
log messages on devices where the first stage is not applicable.
Signed-off-by: Pavan S <pavan.sreenivas@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Dirty state can occur when the host VM undergoes a reset while the
device does not. In such a case, the driver must reset the device before
it can be used again. As part of this reset, the device capabilities
are zeroed. Therefore, the driver must read the Preboot status again to
learn the Preboot state, capabilities, and security configuration.
Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Add a new passthrough type HL_GET_P_STATE to the cpucp generic ioctl
to allow userspace to read the device performance state via firmware.
Signed-off-by: Ariel Aviad <ariel.aviad@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Add debugfs files for NVMe Direct I/O (HLDIO) functionality.
This interface allows userspace access to direct SSD ↔ device transfers
through debugfs nodes.
Four debugfs files are created under /sys/kernel/debug/habanalabs/hlN/:
- dio_ssd2hl : trigger SSD-to-device transfers
- dio_hl2ssd : trigger device-to-SSD transfers
(placeholder, not yet implemented)
- dio_stats : show transfer statistics
- dio_reset : reset statistics counters
Usage examples:
# Perform SSD → device transfer
echo "fd=3 va=0x10000 off=0 len=4096" > \
/sys/kernel/debug/habanalabs/hl0/dio_ssd2hl
# View statistics
cat /sys/kernel/debug/habanalabs/hl0/dio_stats
# Reset counters
echo 1 > /sys/kernel/debug/habanalabs/hl0/dio_reset
This interface provides access to HLDIO functionality for validation
and diagnostics.
Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com>
Reviewed-by: Farah Kassabri <farah.kassabri@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Introduce NVMe Direct I/O (HLDIO) infrastructure to support
peer‑to‑peer DMA in the habanalabs driver. This adds internal helpers
and data structures to enable direct transfers between NVMe storage
and device memory.
The feature is built only when CONFIG_HL_HLDIO is enabled. A debugfs
interface is also provided for functional validation.
Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com>
Reviewed-by: Farah Kassabri <farah.kassabri@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
When IOMMU is enabled, dma_alloc_coherent() with GFP_USER may return
addresses from the vmalloc range. If such an address is mapped without
VM_MIXEDMAP, vm_insert_page() will trigger a BUG_ON due to the
VM_PFNMAP restriction.
Fix this by checking for vmalloc addresses and setting VM_MIXEDMAP
in the VMA before mapping. This ensures safe mapping and avoids kernel
crashes. The memory is still driver-allocated and cannot be accessed
directly by userspace.
Signed-off-by: Moti Haimovski <moti.haimovski@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
The access_ok() API no longer requires the VERIFY_WRITE argument,
and the use of the old interface with VERIFY_WRITE is deprecated.
Clean up the habanalabs memory manager to use the modern access_ok()
interface consistently. This removes old #ifdef guards and aligns the
driver with current upstream kernel APIs.
Signed-off-by: Ilia Levi <ilia.levi@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
After CPLD shutdown event the device is not usable anymore. The common
CPLD_SHUTDOWN event handler disables any subsequent device access.
Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
After a CPLD shutdown event the device becomes unusable. Prevent further
device access once this event is received.
Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Fix a typo in TP_printk format string of habanalabs tracepoint:
replace "cms" with "cmd".
Signed-off-by: Tomer Tayar <tomer.tayar@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
In hl_release_dmabuf(), ctx is dereferenced after calling hl_ctx_put()
to obtain the compute device file.
This is safe because the dma-buf object holds a file reference taken in
export_dmabuf(), and the file release (which drops another ctx reference)
can only happen after we drop that file reference via fput(). Thus, this
hl_ctx_put() call cannot be the last one at this point.
Add a comment explaining this to avoid confusion.
Signed-off-by: Tomer Tayar <tomer.tayar@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Add infrastructure for logging the last configuration register accesses
that occur via debugfs read/write operations. At interrupt time, these
log entries can be dumped to dmesg, which helps in diagnosing the cause
of RAZWI and ADDR_DEC interrupts.
The logging is implemented as a ring buffer of access entries, with each
entry recording timestamp and access details. To ensure correctness
under concurrent access, operations are now protected using spinlocks.
Entries are copied under lock and then printed after releasing it, which
minimizes time spent in the critical section.
Signed-off-by: Sharley Calzolari <sharley.calzolari@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Print engine/queue names instead of numerical engine/queue IDs to make
logs and debug output more readable.
Signed-off-by: Ariel Suller <ariel.suller@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Add a new CPUCP generic message type to retrieve HBM, SRAM and critical
error counters from the device.
Signed-off-by: Vitaly Margolin <vitaly.margolin@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Change the BMON_CR register value back to its original state before
enabling, so that BMON does not continue to collect information
after being disabled.
Signed-off-by: Vered Yavniely <vered.yavniely@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
EFAULT is currently returned if less than requested user pages are
pinned. This value means a "bad address" which might be confusing to
the user, as the address of the given user memory is not necessarily
"bad".
Modify the return value to ENOMEM, as "out of memory" is more suitable
in this case.
Signed-off-by: Tomer Tayar <tomer.tayar@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
This commit adds support for VCN reset functionality in SMU v13.0.12 by:
1. Adding two new PPSMC messages in smu_v13_0_12_ppsmc.h:
- PPSMC_MSG_ResetVCN (0x5E)
- Updates PPSMC_Message_Count to 0x5F to account for new messages
2. Adding message mapping for ResetVCN in smu_v13_0_12_ppt.c:
- Maps SMU_MSG_ResetVCN to PPSMC_MSG_ResetVCN
These changes enable proper VCN reset handling through the SMU firmware
interface for compatible AMD GPUs.
v2: Added fw version check to support vcn queue reset.
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This patch moves the initialization of the VCN supported_reset mask from
sw_init to a new late_init function for VCN 5.0.1. The change ensures
that all necessary hardware and firmware initialization is complete
before determining the supported reset types.
Key changes:
- Added vcn_v5_0_1_late_init() function to handle late initialization
- Moved supported_reset mask setup from sw_init to late_init
- Added check for per-queue reset support via amdgpu_dpm_reset_vcn_is_supported()
- Updated ip_funcs to use the new late_init function
This change helps ensure proper reset behavior by waiting until all
dependencies are initialized before determining available reset types.
Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Implement the ring reset callback for VCN v5.0.1 to properly handle
hardware recovery when encountering GPU hangs. The new functionality:
1. Adds vcn_v5_0_1_ring_reset() function that:
- Prepares for reset using amdgpu_ring_reset_helper_begin()
- Performs VCN instance reset via amdgpu_dpm_reset_vcn()
- Re-initializes hardware through vcn_v5_0_1_hw_init_inst()
- Restarts DPG mode with vcn_v5_0_1_start_dpg_mode()
- Completes reset with amdgpu_ring_reset_helper_end()
2. Hooks the reset function into the unified ring functions via:
- Adding .reset = vcn_v5_0_1_ring_reset to vcn_v5_0_1_unified_ring_vm_funcs
3. Maintains existing behavior for SR-IOV VF cases by checking RRMT status
This provides proper hardware recovery capabilities for VCN 5.0.1 IP block
during fault conditions, matching functionality available in other VCN versions.
v2: Remove the RRMT_ENABLED cap setting in the reset function
and replace adev->vcn.inst[ring->me].indirect_sram with vinst->indirect_sram (Lijo)
Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Split the per-instance initialization code from vcn_v5_0_1_hw_init()
into a new vcn_v5_0_1_hw_init_inst() function. This improves code
organization by:
1. Separating the instance-specific initialization logic
2. Making the main init function more readable
3. Following the pattern used in queue reset
The SR-IOV specific initialization remains in the main function since
it has different requirements.
Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Add an initial documentation page for user mode queues.
Reviewed-by: Rodrigo Siqueira <siqueira@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Remove this flag as the driver stopped managing it individually since
commit a4056c2a6344 ("drm/amd/display: use HW hdr mult for brightness
boost"). After some back and forth it was reintroduced as a condition to
`set_output_transfer_func()` in [1]. Without direct management, this
flag only changes value when all surface update flags are set true on
UPDATE_TYPE_FULL with no output TF status meaning.
Fixes: bb622e0c0044 ("drm/amd/display: program output tf when required") [1]
Signed-off-by: Melissa Wen <mwen@igalia.com>
Reviewed-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|