summaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/ras
AgeCommit message (Collapse)AuthorLines
2026-03-23drm/amd/ras: Add input pointer validation in ras core helpersSrinivasan Shanmugam-0/+9
Add NULL checks for helper input/output pointers that are directly dereferenced, such as tm, seqno, dev_info and init_config. Cc: Tao Zhou <tao.zhou1@amd.com> Cc: YiPeng Chai <YiPeng.Chai@amd.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-23drm/amd/ras: Add NULL checks for ras_core sys_fn callbacksSrinivasan Shanmugam-0/+13
Some ras core helper functions access ras_core and its callback table (sys_fn) without validating them first. Cc: Tao Zhou <tao.zhou1@amd.com> Cc: YiPeng Chai <YiPeng.Chai@amd.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-23drm/amd/ras: Remove redundant NULL check in pending bad-bank list iterationSrinivasan Shanmugam-1/+1
ras_umc_log_pending_bad_bank() walks through a list of pending ECC bad-bank entries. These entries are saved when a bad-bank error cannot be processed immediately, for example during a GPU reset. Later, this function iterates over the pending list and retries logging each bad-bank error. If logging succeeds, the entry is removed from the list and the memory for that node is freed. The loop uses list_for_each_entry_safe(), which already guarantees that ecc_node points to a valid list entry while the loop body is executing. Checking "ecc_node &&" inside the loop is therefore unnecessary and redundant. Fixes the below: drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_umc.c:225 ras_umc_log_pending_bad_bank() warn: variable dereferenced before check 'ecc_node' (see line 223) Fixes: 7a3f9c0992c4 ("drm/amd/ras: Add umc common ras functions") Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: YiPeng Chai <YiPeng.Chai@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-17drm/amd/ras: Fix NULL deref in ras_core_get_utc_second_timestamp()Srinivasan Shanmugam-2/+5
ras_core_get_utc_second_timestamp() retrieves the current UTC timestamp (in seconds since the Unix epoch) through a platform-specific RAS system callback and is used for timestamping RAS error events. The function checks ras_core in the conditional statement before calling the sys_fn callback. However, when the condition fails, the function prints an error message using ras_core->dev. If ras_core is NULL, this can lead to a potential NULL pointer dereference when accessing ras_core->dev. Add an early NULL check for ras_core at the beginning of the function and return 0 when the pointer is not valid. This prevents the dereference and makes the control flow clearer. Fixes: 13c91b5b4378 ("drm/amd/ras: Add rascore unified interface function") Cc: YiPeng Chai <YiPeng.Chai@amd.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-17drm/amd/ras: Fix NULL deref in ras_core_ras_interrupt_detected()Srinivasan Shanmugam-1/+3
Fixes a NULL pointer dereference when ras_core is NULL and ras_core->dev is accessed in the error path. Fixes: 13c91b5b4378 ("drm/amd/ras: Add rascore unified interface function") Reported by: Dan Carpenter <dan.carpenter@linaro.org> Cc: YiPeng Chai <YiPeng.Chai@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-17drm/amd/ras: Pass ras poison consumption message to sriov hostYiPeng Chai-0/+10
Pass ras poison consumption message to sriov host. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-17drm/amd/ras: Add unified interface to handle ras interruptsYiPeng Chai-0/+32
Add unified interface to handle ras interrupts, some redundant interrupt function interfaces will be removed later. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04drm/amd/ras: adapt sync info func for pmfw eepromGangliang Xie-1/+19
adapt sync info func for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04drm/amd/ras: add check func for pmfw eepromGangliang Xie-9/+67
add check func for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04drm/amd/ras: add initialization func for pmfw eepromGangliang Xie-3/+98
add initialization func for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04drm/amd/ras: adapt page retirement process for pmfw eepromGangliang Xie-10/+66
read bad page data from pmfw eeprom when retirement is triggered, use timestamp read from eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04drm/amd/ras: add read func for pmfw eepromGangliang Xie-9/+101
add read func for pmfw eeprom, and adapt address converting for bad pages loaded from pmfw eeprom v2: change label 'Out' to 'out' Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04drm/amd/ras: make MCA IPID parse globalTao Zhou-0/+16
add a new IPID parse interface for umc, so we can implement it for each ASIC, and so we can call it in other blocks Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04drm/amd/ras: add append func for pmfw eepromGangliang Xie-3/+48
add append func for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04drm/amd/ras: add check safety watermark func for pmfw eepromGangliang Xie-0/+37
add check safety watermark func for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02drm/amd/ras: Add table reset func for pmfw eepromGangliang Xie-2/+63
add table reset func for pmfw eeprom, add smu eeprom control structure Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02drm/amd/ras: add wrapper funcs for pmfw eepromGangliang Xie-0/+141
add wrapper funcs for pmfw eeprom interface to make them easier to be called Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02drm/amd/ras: add uniras smu feature flag init funcGangliang Xie-1/+74
add flag to indicate if pmfw eeprom is supported or not, and initialize it v2: change copyright from 2025 to 2026 Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02drm/amd/ras: add pmfw eeprom smu interfacesGangliang Xie-0/+64
add smu interfaces and its data structures for pmfw eeprom in uniras v2: add 'const' to smu messages array, and specify index for each member when initializing. Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02drm/amdgpu: Fix static assertion failure issueYiPeng Chai-2/+2
Since the PAGE_SIZE is 8KB on sparc64, the size of structure amdsriov_ras_telemetry will exceed 64KB, so use absolute value to fix the buffer size. Fixes the issue: drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h:522:2: error: static assertion failed due to requirement 'sizeof(struct amdsriov_ras_telemetry) <= 64 << 10': amdsriov_ras_telemetry must be 64 KB | sizeof(struct amdsriov_ras_telemetry) <= AMD_SRIOV_MSG_RAS_TELEMETRY_SIZE_KB_V1 << 10, drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h:522:40: note: expression evaluates to '115616 <= 65536' | sizeof(struct amdsriov_ras_telemetry) <= AMD_SRIOV_MSG_RAS_TELEMETRY_SIZE_KB_V1 << 10, Fixes: cb48a6b2b61d ("drm/amd/ras: use dedicated memory as vf ras command buffer") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202602261700.rVOLIw4l-lkp@intel.com/ Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02drm/amd/ras: fix kernel-doc warning for ras_eeprom_append()Yujie Liu-2/+2
Warning: drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_eeprom.c:845 function parameter 'ras_core' not described in 'ras_eeprom_append' Warning: drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_eeprom.c:845 expecting prototype for ras_core_eeprom_append(). Prototype was for ras_eeprom_append() instead Fixes: 5c3be5defc92 ("drm/amd/ras: Add eeprom ras functions") Signed-off-by: Yujie Liu <yujie.liu@intel.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02drm/amd/ras: Fix type size of remainder argumentKees Cook-2/+4
Forcing an int to be dereferenced at uint64_t for div64_u64_rem() runs the risk of endian confusion and stack overflowing writes. Seen while preparing to enable -Warray-bounds globally: In file included from ../arch/x86/include/asm/processor.h:35, from ../include/linux/sched.h:13, from ../include/linux/ratelimit.h:6, from ../include/linux/dev_printk.h:16, from ../drivers/gpu/drm/amd/amdgpu/../ras/ras_mgr/ras_sys.h:29, from ../drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras.h:27, from ../drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_core.c:24: In function 'div64_u64_rem', inlined from 'ras_core_convert_timestamp_to_time' at ../drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_core.c:72:9: ../include/linux/math64.h:56:20: error: array subscript 'u64 {aka long long unsigned int}[0]' is partly outside array bounds of 'int[1]' [-Werror=array-bounds=] 56 | *remainder = dividend % divisor; | ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~ ../drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_core.c: In function 'ras_core_convert_timestamp_to_time': ../drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_core.c:70:19: note: object 'remaining_seconds' of size 4 70 | int days, remaining_seconds; | ^~~~~~~~~~~~~~~~~ Use a 64-bit type for the remainder calculation, but leave remaining_seconds as 32-bit to avoid 64-bit division later. The value of remainder will always be less than seconds_per_day, so there's no truncation risk. Fixes: ace232eff50e ("drm/amdgpu: Add ras module files into amdgpu") Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-25drm/amd/ras: Add function to convert retired addressJinzhou Su-0/+29
Add function to convert retired address in SR-IOV guest. Signed-off-by: Jinzhou Su <jinzhou.su@amd.com> Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-25drm/amd/ras: Handle address check in SR-IOV guestJinzhou Su-0/+22
Handle address check validity command in SR-IOV guest Signed-off-by: Jinzhou Su <jinzhou.su@amd.com> Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-25drm/amd/ras: Add convert retired address structureJinzhou Su-0/+14
Add convert retired address command and structure for uniras. Signed-off-by: Jinzhou Su <jinzhou.su@amd.com> Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-25drm/amd/ras: Add address check structureJinzhou Su-0/+15
Add address check command and data structure for uniras. Signed-off-by: Jinzhou Su <jinzhou.su@amd.com> Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-25drm/amd/ras: use dedicated memory as vf ras command bufferYiPeng Chai-34/+96
Use dedicated memory as vf ras command buffer. V2: Add lock to ensure serialization of sending vf ras commands. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Jinzhou Su <jinzhou.su@amd.com> Tested-by: Jinzhou Su <jinzhou.su@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-21Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds-2/+1
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds-9/+9
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook-11/+11
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-05drm/amd/ras: statistic xgmi training error countStanley.Yang-1/+1
Report xgmi training error uncorrectable error count. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05drm/amd/ras: Replace NPS flags in ras moduleJinzhou Su-1/+1
Replace AMDGPU_NPS8_PARTITION_MODE with UMC_MEMORY_PARTITION_MODE_NPS8 to pass sriov compilation. Signed-off-by: Jinzhou Su <jinzhou.su@amd.com> Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05drm/amd/ras: Support physical address convertJinzhou Su-11/+68
Support physical address convert to current NPS pages in uniras. Signed-off-by: Jinzhou Su <jinzhou.su@amd.com> Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-10drm/amd/ras: Add vram_type to ras_ta_init_flagsCandice Li-0/+4
Add vram_type to ras_ta_init_flags. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-10drm/amd/ras: Reduce stack usage in amdgpu_virt_ras_get_cper_records()Srinivasan Shanmugam-4/+13
amdgpu_virt_ras_get_cper_records() was using a large stack array of ras_log_info pointers. This contributed to the frame size warning on this function. Replace the fixed-size stack array: struct ras_log_info *trace[MAX_RECORD_PER_BATCH]; with a heap-allocated array using kcalloc(). We free the trace buffer together with out_buf on all exit paths. If allocation of trace or out_buf fails, we return a generic RAS error code. This reduces stack usage and keeps the runtime behaviour unchanged. Fixes: stack frame size: 1112 bytes (limit: 1024) Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08drm/amd/ras: Reduce stack usage in ras_umc_handle_bad_pages()Srinivasan Shanmugam-8/+21
ras_umc_handle_bad_pages() function used a large local array: struct eeprom_umc_record records[MAX_ECC_NUM_PER_RETIREMENT]; Move this array off the stack by allocating it with kcalloc() and freeing it before return. This reduces the stack frame size of ras_umc_handle_bad_pages() and avoids the frame size warning. Fixes the below: drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_umc.c:498:5: warning: stack frame size (1208) exceeds limit (1024) in 'ras_umc_handle_bad_pages' [-Wframe-larger-than] v2: Removed the duplicate ras_umc_get_new_records() invocation. (Lijo) Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08drm/amd/ras: Compatible with legacy sriov hostYiPeng Chai-0/+36
If sriov host is legacy, the guest uniras will be disabled. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08drm/amd/ras: Add sriov ras preprocessing before gpu resetYiPeng Chai-1/+23
Sriov host may clear all VF commands registered to auto update list during VF reset, set ecc.auto_uUpdate block to false before VF reset, and after VF reset is complete, RAS_CMD__GET_ALL_BLOCK_ECC_STATUS command will be re-registered to auto update list of sriov host. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08drm/amd/ras: Support high-frequency querying sriov ras block error countYiPeng Chai-0/+154
Support high-frequency querying sriov ras block error count: 1. Create shared memory and fills it with RAS_CMD__GET_LAL_LOC_STATUS ras command. 2. The RAS_CMD_GET_ALL_BLOCK_ECC_STATUS command and shared memory are registered to sriov host ras auto-update list via RAS_CMD_SET_CMD_AUTO_UPDATE command. 3. Once sriov host detects ras error, it will automatically execute RAS_CMD__GET_ALL_BLOCK_ECC_STATUS command and write the result to shared memory. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08drm/amd/ras: Add ras command to retrieve cper data from sriov hostYiPeng Chai-1/+173
In order to reduce the number of interactions with sriov host and the amount of data exchanged, a set of ras commands is first used to obtain the raw data used to generate cper from the host, then, guest driver generates cper based on the obtained raw data. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08drm/amd/ras: sriov supports handling VF ras commands.YiPeng Chai-10/+207
Add basic framework code to sriov to handle VF ras commands. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-06drm/amd/ras: ras supports i2c eeprom for mp1 v13_0_12YiPeng Chai-0/+1
ras supports i2c eeprom for mp1 v13_0_12. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amdgpu: suspend ras module before gpu resetYiPeng Chai-0/+110
During gpu reset, all GPU-related resources are inaccessible. To avoid affecting ras functionality, suspend ras module before gpu reset and resume it after gpu reset is complete. V2: Rename functions to avoid misunderstanding. V3: Move flush_delayed_work to amdgpu_ras_process_pause, Move schedule_delayed_work to amdgpu_ras_process_unpause. V4: Rename functions. V5: Move the function to amdgpu_ras.c. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amd/ras: Add ras support for umc v12_5_0YiPeng Chai-1/+3
Add ras support for umc v12_5_0. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amd/ras: Add ras support for nbio v7_9_1YiPeng Chai-1/+3
Add ras support for nbio v7_9_1. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amd/ras: Increase ras switch control rangeYiPeng Chai-6/+19
Increase ras switch control range. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amd/ras: Fix format truncationXiang Liu-2/+2
../ras/rascore/ras_cper.c: In function ‘cper_generate_fatal_record.isra’: ../ras/rascore/ras_cper.c:75:36: error: ‘%llX’ directive output may be truncated writing between 1 and 14 bytes into a region of size between 0 and 7 [-Werror=format-truncation=] 75 | snprintf(record_id, 9, "%d:%llX", dev_info.socket_id, | ^~~~ ../ras/rascore/ras_cper.c:75:32: note: directive argument in the range [0, 72057594037927935] 75 | snprintf(record_id, 9, "%d:%llX", dev_info.socket_id, | ^~~~~~~~~ ../ras/rascore/ras_cper.c:75:9: note: ‘snprintf’ output between 4 and 27 bytes into a destination of size 9 75 | snprintf(record_id, 9, "%d:%llX", dev_info.socket_id, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 76 | RAS_LOG_SEQNO_TO_BATCH_IDX(trace->seqno)); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../ras/rascore/ras_cper.c: In function ‘cper_generate_runtime_record.isra’: ../ras/rascore/ras_cper.c:75:36: error: ‘%llX’ directive output may be truncated writing between 1 and 14 bytes into a region of size between 0 and 7 [-Werror=format-truncation=] 75 | snprintf(record_id, 9, "%d:%llX", dev_info.socket_id, | ^~~~ ../ras/rascore/ras_cper.c:75:32: note: directive argument in the range [0, 72057594037927935] 75 | snprintf(record_id, 9, "%d:%llX", dev_info.socket_id, | ^~~~~~~~~ ../ras/rascore/ras_cper.c:75:9: note: ‘snprintf’ output between 4 and 27 bytes into a destination of size 9 75 | snprintf(record_id, 9, "%d:%llX", dev_info.socket_id, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 76 | RAS_LOG_SEQNO_TO_BATCH_IDX(trace->seqno)); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amd/ras: Use correct severity for BP threshold exceed eventXiang Liu-2/+2
The severity of CPER for BP threshold exceed event should be set as FATAL to match the OOB implementation. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amd/ras: Correct info field of bad page threshold exceed CPERXiang Liu-3/+8
Correct valid_bits and ms_chk_bits of section info field for bad page threshold exceed CPER to match OOB's behavior. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amd/ras: Update IPID value for bad page threshold CPERXiang Liu-1/+8
The IPID register value for bad page threshold CPER holds socket_id info now according to the latest definition. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>