<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h, branch v6.17</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
</subtitle>
<id>https://git.shady.money/linux/atom?h=v6.17</id>
<link rel='self' href='https://git.shady.money/linux/atom?h=v6.17'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/'/>
<updated>2025-03-26T21:44:34Z</updated>
<entry>
<title>drm/amdgpu: Update ta ras block</title>
<updated>2025-03-26T21:44:34Z</updated>
<author>
<name>Stanley.Yang</name>
<email>Stanley.Yang@amd.com</email>
</author>
<published>2025-03-25T03:10:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=cc11dffc14bd2322d92a896a639cc42145401515'/>
<id>urn:sha1:cc11dffc14bd2322d92a896a639cc42145401515</id>
<content type='text'>
Update ta ra block to keep sync with RAS TA.

Signed-off-by: Stanley.Yang &lt;Stanley.Yang@amd.com&gt;
Reviewed-by: Tao Zhou &lt;tao.zhou1@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdgpu: Report generic instead of unknown boot time errors</title>
<updated>2025-02-27T21:50:03Z</updated>
<author>
<name>Xiang Liu</name>
<email>xiang.liu@amd.com</email>
</author>
<published>2025-02-26T06:27:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=d4bd7a50ca7c6199438cf19063464b4d6327a6c1'/>
<id>urn:sha1:d4bd7a50ca7c6199438cf19063464b4d6327a6c1</id>
<content type='text'>
Change the DMESG reporting of unknown errors to "Boot Controller
Generic Error" to align with the RAS SPEC and provide more clarity
to customers.

Signed-off-by: Xiang Liu &lt;xiang.liu@amd.com&gt;
Reviewed-by: Hawking Zhang &lt;Hawking.Zhang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdgpu: Update usage for bad page threshold</title>
<updated>2025-02-13T02:02:59Z</updated>
<author>
<name>Hawking Zhang</name>
<email>Hawking.Zhang@amd.com</email>
</author>
<published>2025-01-22T11:34:33Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=16b85a0942c0b0f1611bcaa42cc98f020e34b1cf'/>
<id>urn:sha1:16b85a0942c0b0f1611bcaa42cc98f020e34b1cf</id>
<content type='text'>
The driver's behavior varies based on
the configuration of amdgpu_bad_page_threshold setting

Signed-off-by: Hawking Zhang &lt;Hawking.Zhang@amd.com&gt;
Reviewed-by: Tao Zhou &lt;tao.zhou1@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes</title>
<updated>2024-12-10T15:26:48Z</updated>
<author>
<name>Tao Zhou</name>
<email>tao.zhou1@amd.com</email>
</author>
<published>2024-10-31T07:48:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=a8d133e625ceb147a173b6cafc862a9bd4312894'/>
<id>urn:sha1:a8d133e625ceb147a173b6cafc862a9bd4312894</id>
<content type='text'>
All legacy RAS bad pages are generated in NPS1 mode, but new bad page
can be generated in any NPS mode, so we can't use retired_page stored
on eeprom directly in non-nps1 mode even for legacy data. We need to
take different actions for different data, new data can be identified
from old data by UMC_CHANNEL_IDX_V2 flag.

Signed-off-by: Tao Zhou &lt;tao.zhou1@amd.com&gt;
Reviewed-by: Hawking Zhang &lt;Hawking.Zhang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdgpu: save UMC global channel index to eeprom</title>
<updated>2024-12-10T15:26:46Z</updated>
<author>
<name>Tao Zhou</name>
<email>tao.zhou1@amd.com</email>
</author>
<published>2024-10-29T11:46:44Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=71a0e9630027f77d7646c5b750593c9ecfaa27d3'/>
<id>urn:sha1:71a0e9630027f77d7646c5b750593c9ecfaa27d3</id>
<content type='text'>
Save the global channel index returned by RAS TA to eeprom.
We can get memory physical address by MCA address and channel index.

Signed-off-by: Tao Zhou &lt;tao.zhou1@amd.com&gt;
Reviewed-by: Hawking Zhang &lt;Hawking.Zhang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdgpu: Prefer RAS recovery for scheduler hang</title>
<updated>2024-12-10T15:26:46Z</updated>
<author>
<name>Lijo Lazar</name>
<email>lijo.lazar@amd.com</email>
</author>
<published>2024-10-24T05:31:57Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=e1ee2111ca48169a9fdc5075f7863f5d4d591e2f'/>
<id>urn:sha1:e1ee2111ca48169a9fdc5075f7863f5d4d591e2f</id>
<content type='text'>
Before scheduling a recovery due to scheduler/job hang, check if a RAS
error is detected. If so, choose RAS recovery to handle the situation. A
scheduler/job hang could be the side effect of a RAS error. In such
cases, it is required to go through the RAS error recovery process. A
RAS error recovery process in certains cases also could avoid a full
device device reset.

An error state is maintained in RAS context to detect the block
affected. Fatal Error state uses unused block id. Set the block id when
error is detected. If the interrupt handler detected a poison error,
it's not required to look for a fatal error. Skip fatal error checking
in such cases.

Signed-off-by: Lijo Lazar &lt;lijo.lazar@amd.com&gt;
Reviewed-by: Hawking Zhang &lt;Hawking.Zhang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdgpu: Implement virt req_ras_err_count</title>
<updated>2024-11-11T16:55:42Z</updated>
<author>
<name>Victor Skvortsov</name>
<email>victor.skvortsov@amd.com</email>
</author>
<published>2024-10-30T14:18:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=84a2947ecc85c67f433f2cc2186e54cdb9047b61'/>
<id>urn:sha1:84a2947ecc85c67f433f2cc2186e54cdb9047b61</id>
<content type='text'>
Enable RAS late init  if VF RAS Telemetry is supported.

When enabled, the VF can use this interface to query total
RAS error counts from the host.

The VF FB access may abruptly end due to a fatal error,
therefore the VF must cache and sanitize the input.

The Host allows 15 Telemetry messages every 60 seconds, afterwhich
the host will ignore any more in-coming telemetry messages. The VF will
rate limit its msg calling to once every 5 seconds (12 times in 60 seconds).
While the VF is rate limited, it will continue to report the last
good cached data.

v2: Flip generate report &amp; update statistics order for VF

Signed-off-by: Victor Skvortsov &lt;victor.skvortsov@amd.com&gt;
Acked-by: Tao Zhou &lt;tao.zhou1@amd.com&gt;
Reviewed-by: Zhigang Luo &lt;zhigang.luo@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdgpu: Add helper to initialize badpage info</title>
<updated>2024-09-26T21:06:38Z</updated>
<author>
<name>Lijo Lazar</name>
<email>lijo.lazar@amd.com</email>
</author>
<published>2024-08-30T05:51:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=b17f87329d49860130a524ab424ecefd3332600f'/>
<id>urn:sha1:b17f87329d49860130a524ab424ecefd3332600f</id>
<content type='text'>
Add a separate function to read badpage data during initialization.
Reading bad pages will need hardware access and cannot be done during
reset. Hence in cases where device needs a full reset during
init itself, attempting to read will cause a deadlock.

Signed-off-by: Lijo Lazar &lt;lijo.lazar@amd.com&gt;
Reviewed-by: Feifei Xu &lt;Feifei.Xu@amd.com&gt;
Reviewed-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
Acked-by: Rajneesh Bhardwaj &lt;rajneesh.bhardwaj@amd.com&gt;
Tested-by: Rajneesh Bhardwaj &lt;rajneesh.bhardwaj@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdgpu: remove RAS unused paramter 'err_addr'</title>
<updated>2024-08-06T15:11:01Z</updated>
<author>
<name>Yang Wang</name>
<email>kevinyang.wang@amd.com</email>
</author>
<published>2024-08-02T02:11:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=671af06690e7f79db51b475a35c3b2619f345abc'/>
<id>urn:sha1:671af06690e7f79db51b475a35c3b2619f345abc</id>
<content type='text'>
- amdgpu_ras_error_statistic_ue_count()
- amdgpu_ras_error_statistic_ce_count()
- amdgpu_ras_error_statistic_de_count()

The parameter 'err_addr' is no longer used since following patch.

Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code")
Signed-off-by: Yang Wang &lt;kevinyang.wang@amd.com&gt;
Reviewed-by: Hawking Zhang &lt;Hawking.Zhang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amdgpu: create function to check RAS RMA status</title>
<updated>2024-08-06T15:11:01Z</updated>
<author>
<name>Tao Zhou</name>
<email>tao.zhou1@amd.com</email>
</author>
<published>2024-08-01T06:26:19Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/linux/commit/?id=792be2e23ac69821db7860ba4ba94592101f0b07'/>
<id>urn:sha1:792be2e23ac69821db7860ba4ba94592101f0b07</id>
<content type='text'>
In the convenience of calling it globally.

Signed-off-by: Tao Zhou &lt;tao.zhou1@amd.com&gt;
Reviewed-by: Hawking Zhang &lt;Hawking.Zhang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
</feed>
