linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h, branch v6.17

drm/amdgpu: Update ta ras block

2025-03-26T21:44:34Z

Update ta ra block to keep sync with RAS TA. Signed-off-by: Stanley.Yang Reviewed-by: Tao Zhou Signed-off-by: Alex Deucher

drm/amdgpu: Report generic instead of unknown boot time errors

2025-02-27T21:50:03Z

Change the DMESG reporting of unknown errors to "Boot Controller Generic Error" to align with the RAS SPEC and provide more clarity to customers. Signed-off-by: Xiang Liu Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Update usage for bad page threshold

2025-02-13T02:02:59Z

The driver's behavior varies based on the configuration of amdgpu_bad_page_threshold setting Signed-off-by: Hawking Zhang Reviewed-by: Tao Zhou Signed-off-by: Alex Deucher

drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes

2024-12-10T15:26:48Z

All legacy RAS bad pages are generated in NPS1 mode, but new bad page can be generated in any NPS mode, so we can't use retired_page stored on eeprom directly in non-nps1 mode even for legacy data. We need to take different actions for different data, new data can be identified from old data by UMC_CHANNEL_IDX_V2 flag. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: save UMC global channel index to eeprom

2024-12-10T15:26:46Z

Save the global channel index returned by RAS TA to eeprom. We can get memory physical address by MCA address and channel index. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Prefer RAS recovery for scheduler hang

2024-12-10T15:26:46Z

Before scheduling a recovery due to scheduler/job hang, check if a RAS error is detected. If so, choose RAS recovery to handle the situation. A scheduler/job hang could be the side effect of a RAS error. In such cases, it is required to go through the RAS error recovery process. A RAS error recovery process in certains cases also could avoid a full device device reset. An error state is maintained in RAS context to detect the block affected. Fatal Error state uses unused block id. Set the block id when error is detected. If the interrupt handler detected a poison error, it's not required to look for a fatal error. Skip fatal error checking in such cases. Signed-off-by: Lijo Lazar Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Implement virt req_ras_err_count

2024-11-11T16:55:42Z

Enable RAS late init if VF RAS Telemetry is supported. When enabled, the VF can use this interface to query total RAS error counts from the host. The VF FB access may abruptly end due to a fatal error, therefore the VF must cache and sanitize the input. The Host allows 15 Telemetry messages every 60 seconds, afterwhich the host will ignore any more in-coming telemetry messages. The VF will rate limit its msg calling to once every 5 seconds (12 times in 60 seconds). While the VF is rate limited, it will continue to report the last good cached data. v2: Flip generate report & update statistics order for VF Signed-off-by: Victor Skvortsov Acked-by: Tao Zhou Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: Add helper to initialize badpage info

2024-09-26T21:06:38Z

Add a separate function to read badpage data during initialization. Reading bad pages will need hardware access and cannot be done during reset. Hence in cases where device needs a full reset during init itself, attempting to read will cause a deadlock. Signed-off-by: Lijo Lazar Reviewed-by: Feifei Xu Reviewed-by: Alex Deucher Acked-by: Rajneesh Bhardwaj Tested-by: Rajneesh Bhardwaj Signed-off-by: Alex Deucher

drm/amdgpu: remove RAS unused paramter 'err_addr'

2024-08-06T15:11:01Z

- amdgpu_ras_error_statistic_ue_count() - amdgpu_ras_error_statistic_ce_count() - amdgpu_ras_error_statistic_de_count() The parameter 'err_addr' is no longer used since following patch. Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code") Signed-off-by: Yang Wang Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: create function to check RAS RMA status

2024-08-06T15:11:01Z

In the convenience of calling it globally. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher