linux/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c, branch v5.1

drm/amdgpu: tighten gpu_recover in mailbox_flr to avoid duplicate recover in sriov

2019-02-13T22:50:13Z

sriov's gpu_recover inside xgpu_ai_mailbox_flr_work would cause duplicate recover in TDR. TDR's gpu_recover would be triggered by amdgpu_job_timedout, that could avoid vk-cts failure by unexpected recover. Signed-off-by: Wentao Lou Acked-by: Andrey Grodzovsky Signed-off-by: Alex Deucher

drm/amdgpu/sriov:Correct pfvf exchange logic

2019-01-14T20:04:24Z

The pfvf exchange need be in exclusive mode. And add pfvf exchange in gpu reset. Signed-off-by: Emily Deng Reviewed-By: Xiangliang Yu Signed-off-by: Alex Deucher

drm/amdgpu: cleanup GPU recovery check a bit (v2)

2018-08-27T16:11:16Z

Check if we should call the function instead of providing the forced flag. v2: rebase on KFD changes (Alex) Signed-off-by: Christian König Acked-by: Andrey Grodzovsky Reviewed-by: Huang Rui Signed-off-by: Alex Deucher

drm/amdgpu/sriov: Need to set in_gpu_reset flag to back after gpu reset

2018-05-15T18:44:06Z

After host os reset gpu reset, need to set flag in_gpu_reset to zero. Signed-off-by: Emily Deng Reviewed-by: Monk Liu Signed-off-by: Alex Deucher

drm/amdgpu: fix spelling mistake: "asssert" -> "assert"

2018-03-22T19:43:43Z

Trivial fix to spelling mistake in pr_err error message text Acked-by: Christian König Signed-off-by: Colin Ian King Signed-off-by: Alex Deucher

drm/amdgpu: Move IH clientid defs to separate file

2018-03-14T20:16:35Z

This is preparation for sharing client ID definitions between amdgpu and amdkfd Signed-off-by: Oak Zeng Reviewed-by: Chunming Zhou Acked-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: refactoring mailbox to fix TDR handshake bugs(v2)

2018-03-14T19:38:27Z

this patch actually refactor mailbox implmentations, and all below changes are needed together to fix all those mailbox handshake issues exposured by heavey TDR test. 1)refactor all mailbox functions based on byte accessing for mb_control reason is to avoid touching non-related bits when writing trn/rcv part of mailbox_control, this way some incorrect INTR sent to hypervisor side could be avoided, and it fixes couple handshake bug. 2)trans_msg function re-impled: put a invalid logic before transmitting message to make sure the ACK bit is in a clear status, otherwise there is chance that ACK asserted already before transmitting message and lead to fake ACK polling. (hypervisor side have some tricks to workaround ACK bit being corrupted by VF FLR which hase an side effects that may make guest side ACK bit asserted wrongly), and clear TRANS_MSG words after message transferred. 3)for mailbox_flr_work, it is also re-worked: it takes the mutex lock first if invoked, to block gpu recover's participate too early while hypervisor side is doing VF FLR. (hypervisor sends FLR_NOTIFY to guest before doing VF FLR and sentds FLR_COMPLETE after VF FLR done, and the FLR_NOTIFY will trigger interrupt to guest which lead to mailbox_flr_work being invoked) This can avoid the issue that mailbox trans msg being cleared by its VF FLR. 4)for mailbox_rcv_irq IRQ routine, it should only peek msg and schedule mailbox_flr_work, instead of ACK to hypervisor itself, because FLR_NOTIFY msg sent from hypervisor side doesn't need VF's ACK (this is because VF's ACK would lead to hypervisor clear its trans_valid/msg, and this would cause handshake bug if trans_valid/msg is cleared not due to correct VF ACK but from a wrong VF ACK like this "FLR_NOTIFY" one) This fixed handshake bug that sometimes GUEST always couldn't receive "READY_TO_ACCESS_GPU" msg from hypervisor. 5)seperate polling time limite accordingly: POLL ACK cost no more than 500ms POLL MSG cost no more than 12000ms POLL FLR finish cost no more than 500ms 6) we still need to set adev into in_gpu_reset mode after we received FLR_NOTIFY from host side, this can prevent innocent app wrongly succesed to open amdgpu dri device. FLR_NOFITY is received due to an IDLE hang detected from hypervisor side which indicating GPU is already die in this VF. v2: use MACRO as the offset of mailbox_control register don't test if NOTIFY_CMPL event in rcv_msg since it won't recieve that message anymore Signed-off-by: Monk Liu Reviewed-by: Pixel Ding Signed-off-by: Alex Deucher

drm/amdgpu: rename amdgpu_gpu_recover

2017-12-18T15:59:58Z

add device to the name for consistency. Acked-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: Simplify amdgpu_lockup_timeout usage.

2017-12-15T22:15:00Z

With introduction of amdgpu_gpu_recovery we don't need any more to rely on amdgpu_lockup_timeout == 0 for disabling GPU reset. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: Add gpu_recovery parameter

2017-12-15T22:14:50Z

Add new parameter to control GPU recovery procedure. v2: Add auto logic where reset is disabled for bare metal and enabled for SR-IOV. Allow forced reset from debugfs. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König Signed-off-by: Alex Deucher