<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/drivers/gpu/drm/amd/amdkfd, branch master</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=master</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=master'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2026-05-05T14:18:23Z</updated>
<entry>
<title>drm/amdkfd: Check if there are kfd porcesses using adev by kfd_processes_count</title>
<updated>2026-05-05T14:18:23Z</updated>
<author>
<name>Xiaogang Chen</name>
<email>xiaogang.chen@amd.com</email>
</author>
<published>2026-04-24T18:47:01Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=81665e35f143d93adef654f3be1360def9196e72'/>
<id>urn:sha1:81665e35f143d93adef654f3be1360def9196e72</id>
<content type='text'>
During gpu hot-unplug need check if there are kfd porcesses still using the
being removed gpu before clean resources of the device. Current driver checks
if kfd_processes_table is empty. kfd processes are not terminated after
removed from kfd_processes_table immediately. They are still alive and may
access the device until kfd_process_wq work queue got ran.

Check kfd-&gt;kfd_processes_count value that is updated after kfd process got
uninitialized when its ref becomes zero.

Fixes: 6cca686dfce7 ("drm/amdkfd: kfd driver supports hot unplug/replug amdgpu devices")
Signed-off-by: Xiaogang Chen &lt;xiaogang.chen@amd.com&gt;
Reviewed-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit d12d05c4bc4c15585130af43e897923ff292df7b)
</content>
</entry>
<entry>
<title>drm/amdkfd: Make all TLB-flushes heavy-weight</title>
<updated>2026-05-05T14:13:54Z</updated>
<author>
<name>Felix Kuehling</name>
<email>felix.kuehling@amd.com</email>
</author>
<published>2026-04-20T15:55:57Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=9b4e3495d1bd2469bf94b74930c153c2d534ddb7'/>
<id>urn:sha1:9b4e3495d1bd2469bf94b74930c153c2d534ddb7</id>
<content type='text'>
With only one sequence number we cannot track the need for legacy vs
heavy-weight flushes reliably. Always use heavy-weight.

Signed-off-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Reviewed-by: Philip Yang &lt;philip.yang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit c1a3ff1d327820cd9a52bc1056b98681fc088949)
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>drm/amdkfd: check if vm ready in svm map and unmap to gpu</title>
<updated>2026-04-24T15:12:07Z</updated>
<author>
<name>YuanShang</name>
<email>YuanShang.Mao@amd.com</email>
</author>
<published>2026-03-26T10:27:30Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d0f5711fa14a09c010537375cf34893cd33bc2ee'/>
<id>urn:sha1:d0f5711fa14a09c010537375cf34893cd33bc2ee</id>
<content type='text'>
Don't map or unmap svm range to gpu if vm is not ready for updates.

Why: DRM entity may already be killed when the svm worker try to
update gpu vm.

Signed-off-by: YuanShang &lt;YuanShang.Mao@amd.com&gt;
Reviewed-by: Philip Yang &lt;philip.yang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit 55f8e366c326980174a4f2b9501b524d8eb25135)
</content>
</entry>
<entry>
<title>drm/amdkfd: validate SVM ioctl nattr against buffer size</title>
<updated>2026-04-24T15:11:29Z</updated>
<author>
<name>Alysa Liu</name>
<email>Alysa.Liu@amd.com</email>
</author>
<published>2026-04-21T14:18:28Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=045e0ff208f0838a246c10204105126611b267a1'/>
<id>urn:sha1:045e0ff208f0838a246c10204105126611b267a1</id>
<content type='text'>
Validate nattr field against the buffer size, preventing
out-of-bounds buffer access via user-controlled attribute count.

Reviewed-by: Amir Shetaia &lt;Amir.Shetaia@amd.com&gt;
Signed-off-by: Alysa Liu &lt;Alysa.Liu@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit 5eca8bfdfa456c3304ca77523718fe24254c172f)
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>drm/amdkfd: Add upper bound check for num_of_nodes</title>
<updated>2026-04-23T16:54:45Z</updated>
<author>
<name>Alysa Liu</name>
<email>Alysa.Liu@amd.com</email>
</author>
<published>2026-03-30T14:50:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=74b73fa56a395d46745e4f245225963e9f8be7f1'/>
<id>urn:sha1:74b73fa56a395d46745e4f245225963e9f8be7f1</id>
<content type='text'>
drm/amdkfd: Add upper bound check for num_of_nodes
in kfd_ioctl_get_process_apertures_new.

Reviewed-by: Harish Kasiviswanathan &lt;Harish.Kasiviswanathan@amd.com&gt;
Signed-off-by: Alysa Liu &lt;Alysa.Liu@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit 98ff46a5ea090c14d2cdb4f5b993b05d74f3949f)
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>amd/amdkfd: add WQ_UNBOUND to alloc_workqueue users</title>
<updated>2026-04-03T17:52:33Z</updated>
<author>
<name>Marco Crivellari</name>
<email>marco.crivellari@suse.com</email>
</author>
<published>2025-12-24T14:47:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=505b1c7342aed9cc3c35c8043fe6bc5ccbeb6b0b'/>
<id>urn:sha1:505b1c7342aed9cc3c35c8043fe6bc5ccbeb6b0b</id>
<content type='text'>
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

   commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")

The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.

This specific workload has no benefit being per-cpu, so its behavior has
been changed using explicitly WQ_UNBOUND.

Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Marco Crivellari &lt;marco.crivellari@suse.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size</title>
<updated>2026-03-30T19:16:39Z</updated>
<author>
<name>Donet Tom</name>
<email>donettom@linux.ibm.com</email>
</author>
<published>2026-03-23T04:28:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a3e14436304392fbada359edd0f1d1659850c9b7'/>
<id>urn:sha1:a3e14436304392fbada359edd0f1d1659850c9b7</id>
<content type='text'>
The control stack size is calculated based on the number of CUs and
waves, and is then aligned to PAGE_SIZE. When the resulting control
stack size is aligned to 64 KB, GPU hangs and queue preemption
failures are observed while running RCCL unit tests on systems with
more than two GPUs.

amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with
doorbell_id: 80030008
amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4
amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with
doorbell_id: 80030008
amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues
amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues

This issue is observed on both 4 KB and 64 KB system page-size
configurations.

This patch fixes the issue by aligning the control stack size to
AMDGPU_GPU_PAGE_SIZE instead of PAGE_SIZE, so the control stack size
will not be 64 KB on systems with a 64 KB page size and queue
preemption works correctly.

Additionally, In the current code, wg_data_size is aligned to PAGE_SIZE,
which can waste memory if the system page size is large. In this patch,
wg_data_size is aligned to AMDGPU_GPU_PAGE_SIZE. The cwsr_size, calculated
from wg_data_size and the control stack size, is aligned to PAGE_SIZE.

Reviewed-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Signed-off-by: Donet Tom &lt;donettom@linux.ibm.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdgpu: use TTM_NUM_MOVE_FENCES when reserving fences</title>
<updated>2026-03-30T19:16:33Z</updated>
<author>
<name>Pierre-Eric Pelloux-Prayer</name>
<email>pierre-eric.pelloux-prayer@amd.com</email>
</author>
<published>2026-02-03T10:22:11Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=6ab4054fda924dabc88e78b13c76043d55d257e0'/>
<id>urn:sha1:6ab4054fda924dabc88e78b13c76043d55d257e0</id>
<content type='text'>
Use TTM_NUM_MOVE_FENCES as an upperbound of how many fences
ttm might need to deal with moves/evictions.

Signed-off-by: Pierre-Eric Pelloux-Prayer &lt;pierre-eric.pelloux-prayer@amd.com&gt;
Acked-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Reviewed-by: Christian König &lt;christian.koenig@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdgpu: allocate move entities dynamically</title>
<updated>2026-03-30T19:16:15Z</updated>
<author>
<name>Pierre-Eric Pelloux-Prayer</name>
<email>pierre-eric.pelloux-prayer@amd.com</email>
</author>
<published>2026-02-03T10:22:09Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=ab5dd4dcc57effc8ab8b4185740a0e3441754f13'/>
<id>urn:sha1:ab5dd4dcc57effc8ab8b4185740a0e3441754f13</id>
<content type='text'>
No functional change for now, as we always allocate a single entity.

Signed-off-by: Pierre-Eric Pelloux-Prayer &lt;pierre-eric.pelloux-prayer@amd.com&gt;
Acked-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Reviewed-by: Christian König &lt;christian.koenig@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdkfd: fix kernel crash on releasing NULL sysfs entry</title>
<updated>2026-03-30T19:14:51Z</updated>
<author>
<name>Eric Huang</name>
<email>jinhuieric.huang@amd.com</email>
</author>
<published>2026-03-27T13:46:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=4ea64d482fc2cc85009fce5abdf4780ece00c31c'/>
<id>urn:sha1:4ea64d482fc2cc85009fce5abdf4780ece00c31c</id>
<content type='text'>
there is an abnormal case that When a process re-opens kfd
with different mm_struct(execve() called by user), the
allocated p-&gt;kobj will be freed, but missed setting it to NULL,
that will cause sysfs/kernel crash with NULL pointers in p-&gt;kobj
on kfd_process_remove_sysfs() when releasing process, and the
similar error on kfd_procfs_del_queue() as well.

Signed-off-by: Eric Huang &lt;jinhuieric.huang@amd.com&gt;
Reviewed-by: Kent Russell &lt;kent.russell@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
</feed>
