linux/drivers/gpu/drm/amd/amdkfd, branch master

drm/amdkfd: Check if there are kfd porcesses using adev by kfd_processes_count

2026-05-05T14:18:23Z

During gpu hot-unplug need check if there are kfd porcesses still using the being removed gpu before clean resources of the device. Current driver checks if kfd_processes_table is empty. kfd processes are not terminated after removed from kfd_processes_table immediately. They are still alive and may access the device until kfd_process_wq work queue got ran. Check kfd->kfd_processes_count value that is updated after kfd process got uninitialized when its ref becomes zero. Fixes: 6cca686dfce7 ("drm/amdkfd: kfd driver supports hot unplug/replug amdgpu devices") Signed-off-by: Xiaogang Chen Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher (cherry picked from commit d12d05c4bc4c15585130af43e897923ff292df7b)

drm/amdkfd: Make all TLB-flushes heavy-weight

2026-05-05T14:13:54Z

With only one sequence number we cannot track the need for legacy vs heavy-weight flushes reliably. Always use heavy-weight. Signed-off-by: Felix Kuehling Reviewed-by: Philip Yang Signed-off-by: Alex Deucher (cherry picked from commit c1a3ff1d327820cd9a52bc1056b98681fc088949) Cc: stable@vger.kernel.org

drm/amdkfd: check if vm ready in svm map and unmap to gpu

2026-04-24T15:12:07Z

Don't map or unmap svm range to gpu if vm is not ready for updates. Why: DRM entity may already be killed when the svm worker try to update gpu vm. Signed-off-by: YuanShang Reviewed-by: Philip Yang Signed-off-by: Alex Deucher (cherry picked from commit 55f8e366c326980174a4f2b9501b524d8eb25135)

drm/amdkfd: validate SVM ioctl nattr against buffer size

2026-04-24T15:11:29Z

Validate nattr field against the buffer size, preventing out-of-bounds buffer access via user-controlled attribute count. Reviewed-by: Amir Shetaia Signed-off-by: Alysa Liu Signed-off-by: Alex Deucher (cherry picked from commit 5eca8bfdfa456c3304ca77523718fe24254c172f) Cc: stable@vger.kernel.org

drm/amdkfd: Add upper bound check for num_of_nodes

2026-04-23T16:54:45Z

drm/amdkfd: Add upper bound check for num_of_nodes in kfd_ioctl_get_process_apertures_new. Reviewed-by: Harish Kasiviswanathan Signed-off-by: Alysa Liu Signed-off-by: Alex Deucher (cherry picked from commit 98ff46a5ea090c14d2cdb4f5b993b05d74f3949f) Cc: stable@vger.kernel.org

amd/amdkfd: add WQ_UNBOUND to alloc_workqueue users

2026-04-03T17:52:33Z

This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") The refactoring is going to alter the default behavior of alloc_workqueue() to be unbound by default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. For more details see the Link tag below. This specific workload has no benefit being per-cpu, so its behavior has been changed using explicitly WQ_UNBOUND. Suggested-by: Tejun Heo Signed-off-by: Marco Crivellari Signed-off-by: Alex Deucher

drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size

2026-03-30T19:16:39Z

The control stack size is calculated based on the number of CUs and waves, and is then aligned to PAGE_SIZE. When the resulting control stack size is aligned to 64 KB, GPU hangs and queue preemption failures are observed while running RCCL unit tests on systems with more than two GPUs. amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with doorbell_id: 80030008 amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4 amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with doorbell_id: 80030008 amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues This issue is observed on both 4 KB and 64 KB system page-size configurations. This patch fixes the issue by aligning the control stack size to AMDGPU_GPU_PAGE_SIZE instead of PAGE_SIZE, so the control stack size will not be 64 KB on systems with a 64 KB page size and queue preemption works correctly. Additionally, In the current code, wg_data_size is aligned to PAGE_SIZE, which can waste memory if the system page size is large. In this patch, wg_data_size is aligned to AMDGPU_GPU_PAGE_SIZE. The cwsr_size, calculated from wg_data_size and the control stack size, is aligned to PAGE_SIZE. Reviewed-by: Felix Kuehling Signed-off-by: Donet Tom Signed-off-by: Alex Deucher

drm/amdgpu: use TTM_NUM_MOVE_FENCES when reserving fences

2026-03-30T19:16:33Z

Use TTM_NUM_MOVE_FENCES as an upperbound of how many fences ttm might need to deal with moves/evictions. Signed-off-by: Pierre-Eric Pelloux-Prayer Acked-by: Felix Kuehling Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: allocate move entities dynamically

2026-03-30T19:16:15Z

No functional change for now, as we always allocate a single entity. Signed-off-by: Pierre-Eric Pelloux-Prayer Acked-by: Felix Kuehling Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdkfd: fix kernel crash on releasing NULL sysfs entry

2026-03-30T19:14:51Z

there is an abnormal case that When a process re-opens kfd with different mm_struct(execve() called by user), the allocated p->kobj will be freed, but missed setting it to NULL, that will cause sysfs/kernel crash with NULL pointers in p->kobj on kfd_process_remove_sysfs() when releasing process, and the similar error on kfd_procfs_del_queue() as well. Signed-off-by: Eric Huang Reviewed-by: Kent Russell Signed-off-by: Alex Deucher