linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h, branch v5.10

drm/amdgpu: fix debugfs creation/removal, again

2020-12-09T15:06:39Z

There is still a warning when CONFIG_DEBUG_FS is disabled: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1145:13: error: 'amdgpu_ras_debugfs_create_ctrl_node' defined but not used [-Werror=unused-function] 1145 | static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) Change the code again to make the compiler actually drop this code but not warn about it. Fixes: ae2bf61ff39e ("drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS") Reviewed-by: Tao Zhou Signed-off-by: Arnd Bergmann Signed-off-by: Alex Deucher

drm/amdgpu: bypass querying ras error count registers

2020-08-14T20:12:22Z

Once ras recovery is issued by ras sync flood interrupt or ras controller interrupt, add this guard to bypass or execute ras error count register harvest of all IPs. Signed-off-by: Guchun Chen Reviewed-by: Hawking Zhang Reviewed-by: Dennis Li Signed-off-by: Alex Deucher

drm/amdgpu: break GPU recovery once it's in bad state(v4)

2020-08-04T21:26:54Z

When GPU executes recovery and retriving bad GPU tag from external eerpom device, the recovery will be broken and error message is printed as well for user's awareness. v2: Refine warning message in threshold reaching case, and fix spelling typo. v3: Fix explicit calling of bad gpu. v4: Rename function names. Signed-off-by: Guchun Chen Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: skip bad page reservation once issuing from eeprom write

2020-08-04T21:26:38Z

Once the ras recovery is issued from eeprom write itself, bad page reservation should be ignored, otherwise, recursive calling of writting to eeprom would happen. Signed-off-by: Guchun Chen Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: validate bad page threshold in ras(v3)

2020-08-04T21:25:58Z

Bad page threshold value should be valid in the range between -1 and max records length of eeprom. It could determine when saved bad pages exceed threshold value, and proceed corresponding actions. v2: When using the default typical value, it should be min value between typical value and eeprom max records length. v3: drop the case of setting bad_page_cnt_threshold to be 0xFFFFFFFF, as it confuses user. Signed-off-by: Guchun Chen Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: RAS emergency restart logic refine

2020-07-15T16:41:47Z

If we are in RAS triggered situation and BACO isn't support, emergency restart is needed, and this code is only needed for some specific cases(vega20 with given smu fw version). After we add smu mode1 reset for sienna cichlid, we need to share AMD_RESET_METHOD_MODE1 with psp mode1 reset, so in amdgpu_device_gpu_recover, we need differentiate which mode1 reset we are using, then decide if it's a full reset and then decide if emergency restart is needed, the logic will become much more complex. After discussion with Hawking, move emergency restart logic to an independent function. Signed-off-by: Likun Gao Signed-off-by: Wenhui Sheng Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: disable ras query and iject during gpu reset

2020-04-01T18:44:42Z

added flag to ras context to indicate if ras query functionality is ready Reviewed-by: Hawking Zhang Signed-off-by: John Clements Signed-off-by: Alex Deucher

drm/amdgpu: add function to creat all ras debugfs node

2020-03-10T19:55:02Z

centralize all debugfs creation in one place for ras this is required to fix ras when the driver does not use the drm load and unload callbacks due to ordering issues with the drm device node. Signed-off-by: Tao Zhou Signed-off-by: Stanley.Yang Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpu

2019-12-18T21:09:06Z

BACO reset mode strategy is determined by latter func when calling amdgpu_ras_reset_gpu. So not to confuse audience, drop it. Signed-off-by: Guchun Chen Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: clear err_event_athub flag after reset exit

2019-12-05T21:26:11Z

Otherwise next err_event_athub error cannot call gpu reset. And following resume sequence will not be affected by this flag. v2: create function to clear amdgpu_ras_in_intr for modularity of ras driver Signed-off-by: Le Ma Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher