linux/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c, branch v6.0

drm/amdgpu: support reset flag set for gpu reset

2022-07-13T15:25:17Z

Move reset_context out of gpu recover function to make it configurable for different reset purpose. For the reset way of call gpu_recovery sysfs, force to use full reset method. Otherwise, try soft reset by default if the related ASIC supportted, if soft reset failed, will use full reset. Signed-off-by: Likun Gao Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Follow up change to previous drm scheduler change.

2022-06-28T15:24:41Z

Align refcount behaviour for amdgpu_job embedded HW fence with classic pointer style HW fences by increasing refcount each time emit is called so amdgpu code doesn't need to make workarounds using amdgpu_job.job_run_counter to keep the HW fence refcount balanced. Also since in the previous patch we resumed setting s_fence->parent to NULL in drm_sched_stop switch to directly checking if job->hw_fence is signaled to short circuit reset if already signed. Signed-off-by: Andrey Grodzovsky Tested-by: Yiqing Yao Acked-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: To flush tlb for MMHUB of RAVEN series

2022-06-23T21:21:53Z

amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:8 pasid:32769, for process test_basic pid 3305 thread test_basic pid 3305) amdgpu: in page starting at address 0x00007ff990003000 from IH client 0x12 (VMC) amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00840051 amdgpu: Faulty UTCL2 client ID: MP1 (0x0) amdgpu: MORE_FAULTS: 0x1 amdgpu: WALKER_ERROR: 0x0 amdgpu: PERMISSION_FAULTS: 0x5 amdgpu: MAPPING_ERROR: 0x0 amdgpu: RW: 0x1 When memory is allocated by kfd, no one triggers the tlb flush for MMHUB0. There is page fault from MMHUB0. v2:fix indentation v3:change subject and fix indentation Signed-off-by: Ruili Ji Reviewed-by: Philip Yang Reviewed-by: Aaron Liu Acked-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover

2022-06-10T19:26:12Z

We removed the wrapper that was queueing the recover function into reset domain queue who was using this name. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: Add work_struct for GPU reset from kfd.

2022-06-10T19:26:07Z

We need to have a work_struct to cancel this reset if another already in progress. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdkfd: Add KFD support for soc21 v3

2022-05-04T14:43:54Z

Add initial support for soc21 in KFD compute driver (Mukul) - Add new definition for soc21 device. - Add new file for amdgpu-kfd interface for GFX11 family. - Add new file for queue management, interrupt handling, mqd management for GFX11 family in KFD driver. - Related changes/updates for soc21 device in KFD driver. - Repurpose last 2 entries of SDMA MQD for driver use. v2: Add an optional argument into update queue operation (Mukul) v3: Switch to ip version check, replace kgd_dev with amdgpu_device (Hawking) Signed-off-by: Mukul Joshi Signed-off-by: Hawking Zhang Reviewed-by: Oak Zeng Signed-off-by: Alex Deucher

drm/amdkfd: Fix NULL pointer dereference

2022-04-07T20:37:39Z

Check that adev->gfx.ras is valid before using it. Fixes: 6475ae2b742876 ("drm/amdgpu: add UTCL2 RAS poison query for Aldebaran (v2)") CC: Tao Zhou Signed-off-by: Felix Kuehling Reviewed-by: Mukul Joshi Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: add UTCL2 RAS poison query for Aldebaran (v2)

2022-03-25T16:40:26Z

Add help functions to query and reset RAS UTCL2 poison status. v2: implement it on amdgpu side and kfd only calls it. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdkfd: remove unused function

2022-01-11T20:44:26Z

Remove unused amdgpu_amdkfd_get_vram_usage() CC: Felix.Kuehling@amd.com Signed-off-by: Nirmoy Das Reviewed-by: Christian König Signed-off-by: Alex Deucher Fixes: dfcbe6d5f4a340 ("drm/amdgpu: Remove unused function pointers")

drm/amdgpu: save error count in RAS poison handler

2021-12-30T13:54:45Z

Otherwise the RAS error count couldn't be queried from sysfs. Signed-off-by: Tao Zhou Reviewed-by: Stanley.Yang Signed-off-by: Alex Deucher