linux/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h, branch v6.12

drm/amdgpu: Disable dpm_enabled flag while VF is in reset

2024-08-13T16:12:52Z

VFs do not perform HW fini/suspend in FLR, so the dpm_enabled is incorrectly kept enabled. Add interface to disable it in virt_pre_reset call. v2: Made implementation generic for all asics v3: Re-order conditionals so PP_MP1_STATE_FLR is only evaluated on VF Signed-off-by: Victor Skvortsov Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher

drm/amdgpu: process RAS fatal error MB notification

2024-06-27T21:31:37Z

For RAS error scenario, VF guest driver will check mailbox and set fed flag to avoid unnecessary HW accesses. additionally, poll for reset completion message first to avoid accidentally spamming multiple reset requests to host. v2: add another mailbox check for handling case where kfd detects timeout first v3: set host_flr bit and use wait_for_reset Signed-off-by: Vignesh Chander Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: fix sriov host flr handler

2024-06-14T20:15:58Z

We send back the ready to reset message before we stop anything. This is wrong. Move it to when we are actually ready for the FLR to happen. In the current state since we take tens of seconds to stop everything, it is very likely that host would give up waiting and reset the GPU before we send ready, so it would be the same as before. But this gets rid of the hack with reset_domain locking and also let us tell how slow ready to reset actually is from the host. The ready to reset speed can be improved later. Signed-off-by: Yunxiang Li Acked-by: Christian König Reviewed-by: Emily Deng Signed-off-by: Alex Deucher

drm/amdgpu: Add lock around VF RLCG interface

2024-05-29T18:48:30Z

flush_gpu_tlb may be called from another thread while device_gpu_recover is running. Both of these threads access registers through the VF RLCG interface during VF Full Access. Add a lock around this interface to prevent race conditions between these threads. Signed-off-by: Victor Skvortsov Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

amd/amdgpu: improve VF recover time

2024-04-10T02:14:30Z

1. change AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT from 30 to 5. 2. set fatel error detected flag. Signed-off-by: Zhigang Luo Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher

drm/amd/amdgpu: support MES command SET_HW_RESOURCE1 in sriov

2024-04-10T02:08:53Z

support MES command SET_HW_RESOURCE1 in sriov Signed-off-by: chongli2 Reviewed-by: Jingwen Chen Acked-by: Jingwen Chen Signed-off-by: Alex Deucher

drm/amdgpu: trigger flr_work if reading pf2vf data failed

2024-03-20T17:38:13Z

if reading pf2vf data failed 30 times continuously, it means something is wrong. Need to trigger flr_work to recover the issue. also use dev_err to print the error message to get which device has issue and add warning message if waiting IDH_FLR_NOTIFICATION_CMPL timeout. Signed-off-by: Zhigang Luo Acked-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Improve error checking in amdgpu_virt_rlcg_reg_rw (v2)

2024-02-22T15:27:23Z

The current error detection only looks for a timeout. This should be changed to also check scratch_reg1 for any errors returned from RLCG. v2: remove new error value Signed-off-by: Victor Lu Acked-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: Support passing poison consumption ras block to SRIOV

2024-01-25T19:58:03Z

Support passing poison consumption ras blocks to SRIOV. Signed-off-by: YiPeng Chai Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: move kiq_reg_write_reg_wait() out of amdgpu_virt.c

2024-01-15T23:35:36Z

It's used for more than just SR-IOV now, so move it to amdgpu_gmc.c and rename it to better match the functionality and update the comments in the code paths to better document when each path is used and why. No functional change. Reviewed-by: Shaoyun.liu Acked-by: Christian König Signed-off-by: Alex Deucher Cc: Shaoyun.Liu@amd.com Cc: Christian.Koenig@amd.com