linux/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c, branch v6.14

drm/admgpu: replace kmalloc() and memcpy() with kmemdup()

2024-12-18T17:39:08Z

The static analyser tool gave the following advice: ./drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c:1266:7-14: WARNING opportunity for kmemdup → 1266 tmp = kmalloc(used_size, GFP_KERNEL); 1267 if (!tmp) 1268 return -ENOMEM; 1269 → 1270 memcpy(tmp, &host_telemetry->body.error_count, used_size); Replacing kmalloc() + memcpy() with kmemdump() doesn't change semantics. Original code works without fault, so this is not a bug fix but proposed improvement. Link: https://lwn.net/Articles/198928/ Fixes: 84a2947ecc85 ("drm/amdgpu: Implement virt req_ras_err_count") Cc: Alex Deucher Cc: "Christian König" Cc: Xinhui Pan Cc: David Airlie Cc: Simona Vetter Cc: Zhigang Luo Cc: Victor Skvortsov Cc: Hawking Zhang Cc: Lijo Lazar Cc: Yunxiang Li Cc: Jack Xiao Cc: Vignesh Chander Cc: Danijel Slivka Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Lijo Lazar Signed-off-by: Mirsad Todorovac Signed-off-by: Alex Deucher

drm/amdgpu: Implement virt req_ras_err_count

2024-11-11T16:55:42Z

Enable RAS late init if VF RAS Telemetry is supported. When enabled, the VF can use this interface to query total RAS error counts from the host. The VF FB access may abruptly end due to a fatal error, therefore the VF must cache and sanitize the input. The Host allows 15 Telemetry messages every 60 seconds, afterwhich the host will ignore any more in-coming telemetry messages. The VF will rate limit its msg calling to once every 5 seconds (12 times in 60 seconds). While the VF is rate limited, it will continue to report the last good cached data. v2: Flip generate report & update statistics order for VF Signed-off-by: Victor Skvortsov Acked-by: Tao Zhou Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: VF Query RAS Caps from Host if supported

2024-11-11T16:55:36Z

If VF RAS Capability support is enabled, guest is able to retrieve the real RAS support from the host. Signed-off-by: Victor Skvortsov Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: Disable dpm_enabled flag while VF is in reset

2024-08-13T16:12:52Z

VFs do not perform HW fini/suspend in FLR, so the dpm_enabled is incorrectly kept enabled. Add interface to disable it in virt_pre_reset call. v2: Made implementation generic for all asics v3: Re-order conditionals so PP_MP1_STATE_FLR is only evaluated on VF Signed-off-by: Victor Skvortsov Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher

drm/amdgpu/mes: add multiple mes ring instances support

2024-08-13T14:29:25Z

Add multiple mes ring instances in mes structure to support multiple mes pipes. Signed-off-by: Jack Xiao Acked-by: Alex Deucher Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Set no_hw_access when VF request full GPU fails

2024-07-08T20:46:56Z

[Why] If VF request full GPU access and the request failed, the VF driver can get stuck accessing registers for an extended period during the unload of KMS. [How] Set no_hw_access flag when VF request for full GPU access fails This prevents further hardware access attempts, avoiding the prolonged stuck state. Signed-off-by: Yifan Zha Acked-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: process RAS fatal error MB notification

2024-06-27T21:31:37Z

For RAS error scenario, VF guest driver will check mailbox and set fed flag to avoid unnecessary HW accesses. additionally, poll for reset completion message first to avoid accidentally spamming multiple reset requests to host. v2: add another mailbox check for handling case where kfd detects timeout first v3: set host_flr bit and use wait_for_reset Signed-off-by: Vignesh Chander Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: fix sriov host flr handler

2024-06-14T20:15:58Z

We send back the ready to reset message before we stop anything. This is wrong. Move it to when we are actually ready for the FLR to happen. In the current state since we take tens of seconds to stop everything, it is very likely that host would give up waiting and reset the GPU before we send ready, so it would be the same as before. But this gets rid of the hack with reset_domain locking and also let us tell how slow ready to reset actually is from the host. The ready to reset speed can be improved later. Signed-off-by: Yunxiang Li Acked-by: Christian König Reviewed-by: Emily Deng Signed-off-by: Alex Deucher

drm/amdgpu: add skip_hw_access checks for sriov

2024-06-14T20:15:58Z

Accessing registers via host is missing the check for skip_hw_access and the lockdep check that comes with it. Signed-off-by: Yunxiang Li Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: fix failure mapping legacy queue when FLR

2024-06-05T15:25:14Z

Flag "mes.ring.shced.ready" will be set as true after mes hw init and set as false when mes hw fini to avoid duplicate initialization. But hw fini will not be called when function level reset, which will cause mes hw init be skipped during FLR, which will leads to mapping legacy queue fail. Set this flag as false when post reset will fix this issue. Signed-off-by: Lin.Cao Acked-by: Alex Deucher Signed-off-by: Alex Deucher