linux/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c, branch v6.0

drm/amdgpu: support reset flag set for gpu reset

2022-07-13T15:25:17Z

Move reset_context out of gpu recover function to make it configurable for different reset purpose. For the reset way of call gpu_recovery sysfs, force to use full reset method. Otherwise, try soft reset by default if the related ASIC supportted, if soft reset failed, will use full reset. Signed-off-by: Likun Gao Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover

2022-06-10T19:26:12Z

We removed the wrapper that was queueing the recover function into reset domain queue who was using this name. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: Move in_gpu_reset into reset_domain

2022-02-09T17:17:57Z

We should have a single instance per entrire reset domain. Signed-off-by: Andrey Grodzovsky Suggested-by: Lijo Lazar Reviewed-by: Christian König Link: https://www.spinics.net/lists/amd-gfx/msg74116.html

drm/amdgpu: Move reset sem into reset_domain

2022-02-09T17:17:32Z

We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could be in a reset in the moment. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König Link: https://www.spinics.net/lists/amd-gfx/msg74117.html

drm/amdgpu: Rework reset domain to be refcounted.

2022-02-09T17:17:09Z

The reset domain contains register access semaphor now and so needs to be present as long as each device in a hive needs it and so it cannot be binded to XGMI hive life cycle. Adress this by making reset domain refcounted and pointed by each member of the hive and the hive itself. v4: Fix crash on boot witrh XGMI hive by adding type to reset_domain. XGMI will only create a new reset_domain if prevoius was of single device type meaning it's first boot. Otherwsie it will take a refocunt to exsiting reset_domain from the amdgou device. Add a wrapper around reset_domain->refcount get/put and a wrapper around send to reset wq (Lijo) Signed-off-by: Andrey Grodzovsky Acked-by: Christian König Link: https://www.spinics.net/lists/amd-gfx/msg74121.html

drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.

2022-02-09T17:16:06Z

No need to to trigger another work queue inside the work queue. v3: Problem: Extra reset caused by host side FLR notification following guest side triggered reset. Fix: Preven qeuing flr_work from mailbox irq if guest already executing a reset. Suggested-by: Liu Shaoyun Signed-off-by: Andrey Grodzovsky Reviewed-by: Liu Shaoyun Link: https://www.spinics.net/lists/amd-gfx/msg74114.html

drm/amdgpu: add dummy event6 for vega10

2022-01-07T22:19:34Z

[why] Malicious mailbox event1 fails driver loading on vega10. A dummy event6 prevent driver from taking response from malicious event1 as its own. [how] On vega10, send a mailbox event6 before sending event1. Signed-off-by: James Yao Reviewed-by: Jingwen Chen Signed-off-by: Alex Deucher

drm/amdgpu: SRIOV flr_work should use down_write

2021-12-14T21:09:02Z

Host initiated VF FLR may fail if someone else is already holding a read_lock. Change from down_write_trylock to down_write to guarantee the reset goes through. Signed-off-by: Victor Skvortsov Reviewed by: Shaoyun.liu Signed-off-by: Alex Deucher

drm/amd/amdgpu: Add ready_to_reset resp for vega10

2021-08-30T18:59:33Z

Send response to host after received the flr notification from host. Port NV change to vega10. Signed-off-by: YuBiao Wang Reviewed-by: Jingwen Chen Signed-off-by: Alex Deucher

drm/amdgpu: SRIOV flr_work should take write_lock

2021-07-13T15:48:09Z

[Why] If flr_work takes read_lock, then other threads who takes read_lock can access hardware when host is doing vf flr. [How] flr_work should take write_lock to avoid this case. Signed-off-by: Jingwen Chen Reviewed-by: Monk Liu Signed-off-by: Alex Deucher