linux/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c, branch v6.12

drm/amdgpu: Disable dpm_enabled flag while VF is in reset

2024-08-13T16:12:52Z

VFs do not perform HW fini/suspend in FLR, so the dpm_enabled is incorrectly kept enabled. Add interface to disable it in virt_pre_reset call. v2: Made implementation generic for all asics v3: Re-order conditionals so PP_MP1_STATE_FLR is only evaluated on VF Signed-off-by: Victor Skvortsov Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher

drm/amdgpu/mes: add multiple mes ring instances support

2024-08-13T14:29:25Z

Add multiple mes ring instances in mes structure to support multiple mes pipes. Signed-off-by: Jack Xiao Acked-by: Alex Deucher Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Set no_hw_access when VF request full GPU fails

2024-07-08T20:46:56Z

[Why] If VF request full GPU access and the request failed, the VF driver can get stuck accessing registers for an extended period during the unload of KMS. [How] Set no_hw_access flag when VF request for full GPU access fails This prevents further hardware access attempts, avoiding the prolonged stuck state. Signed-off-by: Yifan Zha Acked-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: process RAS fatal error MB notification

2024-06-27T21:31:37Z

For RAS error scenario, VF guest driver will check mailbox and set fed flag to avoid unnecessary HW accesses. additionally, poll for reset completion message first to avoid accidentally spamming multiple reset requests to host. v2: add another mailbox check for handling case where kfd detects timeout first v3: set host_flr bit and use wait_for_reset Signed-off-by: Vignesh Chander Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: fix sriov host flr handler

2024-06-14T20:15:58Z

We send back the ready to reset message before we stop anything. This is wrong. Move it to when we are actually ready for the FLR to happen. In the current state since we take tens of seconds to stop everything, it is very likely that host would give up waiting and reset the GPU before we send ready, so it would be the same as before. But this gets rid of the hack with reset_domain locking and also let us tell how slow ready to reset actually is from the host. The ready to reset speed can be improved later. Signed-off-by: Yunxiang Li Acked-by: Christian König Reviewed-by: Emily Deng Signed-off-by: Alex Deucher

drm/amdgpu: add skip_hw_access checks for sriov

2024-06-14T20:15:58Z

Accessing registers via host is missing the check for skip_hw_access and the lockdep check that comes with it. Signed-off-by: Yunxiang Li Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: fix failure mapping legacy queue when FLR

2024-06-05T15:25:14Z

Flag "mes.ring.shced.ready" will be set as true after mes hw init and set as false when mes hw fini to avoid duplicate initialization. But hw fini will not be called when function level reset, which will cause mes hw init be skipped during FLR, which will leads to mapping legacy queue fail. Set this flag as false when post reset will fix this issue. Signed-off-by: Lin.Cao Acked-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: Add lock around VF RLCG interface

2024-05-29T18:48:30Z

flush_gpu_tlb may be called from another thread while device_gpu_recover is running. Both of these threads access registers through the VF RLCG interface during VF Full Access. Add a lock around this interface to prevent race conditions between these threads. Signed-off-by: Victor Skvortsov Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: Queue KFD reset workitem in VF FED

2024-05-20T20:20:25Z

The guest recovery sequence is buggy in Fatal Error when both FLR & KFD reset workitems are queued at the same time. In addition, FLR guest recovery sequence is out of order when PF/VF communication breaks due to a GPU fatal error As a temporary work around, perform a KFD style reset (Initiate reset request from the guest) inside the pf2vf thread on FED. Signed-off-by: Victor Skvortsov Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: Fix two reset triggered in a row

2024-05-02T19:40:44Z

Some times a hang GPU causes multiple reset sources to schedule resets. The second source will be able to trigger an unnecessary reset if they schedule after we call amdgpu_device_stop_pending_resets. Move amdgpu_device_stop_pending_resets to after the reset is done. Since at this point the GPU is supposedly in a good state, any reset scheduled after this point would be a legitimate reset. Remove unnecessary and incorrect checks for amdgpu_in_reset that was kinda serving this purpose. Signed-off-by: Yunxiang Li Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher