linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h, branch v5.14

drm/amdgpu: Return error if no RAS

2021-07-13T15:48:10Z

In amdgpu_ras_query_error_count() return an error if the device doesn't support RAS. This prevents that function from having to always set the values of the integer pointers (if set), and thus prevents function side effects--always to have to set values of integers if integer pointers set, regardless of whether RAS is supported or not--with this change this side effect is mitigated. Also, if no pointers are set, don't count, since we've no way of reporting the counts. Also, give this function a kernel-doc. Cc: Alexander Deucher Cc: John Clements Cc: Hawking Zhang Reported-by: Tom Rix Fixes: a46751fbcde505 ("drm/amdgpu: Fix RAS function interface") Signed-off-by: Luben Tuikov Reviewed-by: Alexander Deucher Signed-off-by: Alex Deucher

drm/amdgpu: Use delayed work to collect RAS error counters

2021-05-27T16:23:06Z

On Context Query2 IOCTL return the correctable and uncorrectable errors in O(1) fashion, from cached values, and schedule a delayed work function to calculate and cache them for the next such IOCTL. v2: Cancel pending delayed work at ras_fini(). v3: Remove conditionals when dealing with delayed work manipulation as they're inherently racy. Cc: Alexander Deucher Cc: Christian König Cc: John Clements Cc: Hawking Zhang Signed-off-by: Luben Tuikov Reviewed-by: Alexander Deucher Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: Fix RAS function interface

2021-05-27T16:22:54Z

The correctable and uncorrectable errors are calculated at each invocation of this function. Therefore, it is highly inefficient to return just one of them based on a Boolean input. If the caller wants both, twice the work would be done. (And this work is O(n^3) on Vega20.) Fix this "interface" to simply return what it had calculated--both values. Let the caller choose what it wants to record, inspect, use. Cc: Alexander Deucher Cc: John Clements Cc: Hawking Zhang Signed-off-by: Luben Tuikov Reviewed-by: Alexander Deucher Signed-off-by: Alex Deucher

drm/amdgpu: Conditionally reset RAS counters on boot

2021-05-20T02:38:11Z

Only clear RAS error counters if perestent EDC harvesting is not supported Reviewed-by: Hawking Zhang Signed-off-by: John Clements Signed-off-by: Alex Deucher

drm/amdgpu: Rename to ras_*_enabled

2021-05-10T22:08:12Z

Rename, ras_hw_supported --> ras_hw_enabled, and ras_features --> ras_enabled, to show that ras_enabled is a subset of ras_hw_enabled, which itself is a subset of the ASIC capability. Cc: Alexander Deucher Cc: John Clements Cc: Hawking Zhang Signed-off-by: Luben Tuikov Acked-by: Christian König Reviewed-by: John Clements Signed-off-by: Alex Deucher

drm/amdgpu: Move up ras_hw_supported

2021-05-10T22:08:06Z

Move ras_hw_supported into struct amdgpu_dev. The dependency is: struct amdgpu_ras <== struct amdgpu_dev <== ASIC, read as "struct amdgpu_ras depends on struct amdgpu_dev, which depends on the hardware." This can be loosely understood as, "if RAS is supported, which is property of the ASIC (struct amdgpu_dev), then we can access struct amdgpu_ras." v2: Fix a typo: must binary AND in ternary cond in amdgpu_ras.c Cc: Alexander Deucher Cc: John Clements Cc: Hawking Zhang Signed-off-by: Luben Tuikov Acked-by: Christian König Reviewed-by: John Clements Signed-off-by: Alex Deucher

drm/amdgpu: Remove redundant ras->supported

2021-05-10T22:07:56Z

Remove redundant ras->supported, as this value is also stored in adev->ras_features. Use adev->ras_features, as that supercedes "ras", since the latter is its member. The dependency goes like this: ras <== adev->ras_features <== hw_supported, and is read as "ras depends on ras_features, which depends on hw_supported." The arrows show the flow of information, i.e. the dependency update. "hw_supported" should also live in "adev". Cc: Alexander Deucher Cc: John Clements Cc: Hawking Zhang Signed-off-by: Luben Tuikov Acked-by: Christian König Reviewed-by: John Clements Signed-off-by: Alex Deucher

drm/amdgpu: fix send ras disable cmd when asic not support ras

2021-03-24T03:30:12Z

cause: It is necessary to send ras disable command to ras-ta during gfx block ras later init, because the ras capability is disable read from vbios for vega20 gaming, but the ras context is released during ras init process, this will cause send ras disable command to ras-to failed. how: Delay releasing ras context, the ras context will be released after gfx block later init done. Changed from V1: move release_ras_context into ras_resume Changed from V2: check BIT(UMC) is more reasonable before access eeprom table Signed-off-by: Stanley.Yang Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: harvest edc status when connected to host via xGMI

2021-03-24T03:00:41Z

When connected to a host via xGMI, system fatal errors may trigger warm reset, driver has no change to query edc status before reset. Therefore in this case, driver should harvest previous error loging registers during boot, instead of only resetting them. v2: 1. IP's ras_manager object is created when its ras feature is enabled, so change to query edc status after amdgpu_ras_late_init called 2. change to enable watchdog timer after finishing gfx edc init Signed-off-by: Dennis Li Reivewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: remove unnecessary reading for epprom header

2021-02-26T22:23:49Z

If the number of badpage records exceed the threshold, driver has updated both epprom header and control->tbl_hdr.header before gpu reset, therefore GPU recovery thread no need to read epprom header directly. v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold Signed-off-by: Dennis Li Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher