linux/lib/crc/arm64, branch master

lib/crc: arm64: Simplify intrinsics implementation

2026-04-02T23:14:53Z

NEON intrinsics are useful because they remove the need for manual register allocation, and the resulting code can be re-compiled and optimized for different micro-architectures, and shared between arm64 and 32-bit ARM. However, the strong typing of the vector variables can lead to incomprehensible gibberish, as is the case with the new CRC64 implementation. To address this, let's repaint all variables as uint64x2_t to minimize the number of vreinterpretq_xxx() calls, and to be able to rely on the ^ operator for exclusive OR operations. This makes the code much more concise and readable. While at it, wrap the calls to vmull_p64() et al in order to have a more consistent calling convention, and encapsulate any remaining vreinterpret() calls that are still needed. Signed-off-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20260330144630.33026-11-ardb@kernel.org Signed-off-by: Eric Biggers

lib/crc: arm64: Drop unnecessary chunking logic from crc64

2026-04-02T23:14:53Z

On arm64, kernel mode NEON executes with preemption enabled, so there is no need to chunk the input by hand. Signed-off-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20260330144630.33026-8-ardb@kernel.org Signed-off-by: Eric Biggers

lib/crc: arm64: Assume a little-endian kernel

2026-04-02T23:13:18Z

Since support for big-endian arm64 kernels was removed, the CPU_LE() macro now unconditionally emits the code it is passed, and the CPU_BE() macro now unconditionally discards the code it is passed. Simplify the assembly code in lib/crc/arm64/ accordingly. Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20260401004431.151432-1-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation

2026-03-29T20:22:13Z

Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR software implementation is slow, which creates a bottleneck in NVMe and other storage subsystems. The acceleration is implemented using C intrinsics () rather than raw assembly for better readability and maintainability. Key highlights of this implementation: - Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency spikes on large buffers. - Pre-calculates and loads fold constants via vld1q_u64() to minimize register spilling. - Benchmarks show the break-even point against the generic implementation is around 128 bytes. The PMULL path is enabled only for len >= 128. Performance results (kunit crc_benchmark on Cortex-A72): - Generic (len=4096): ~268 MB/s - PMULL (len=4096): ~1556 MB/s (nearly 6x improvement) Signed-off-by: Demian Shulhan Link: https://lore.kernel.org/r/20260329074338.1053550-1-demyansh@gmail.com Signed-off-by: Eric Biggers

lib/crc: Switch ARM and arm64 to 'ksimd' scoped guard API

2025-11-12T08:52:01Z

Before modifying the prototypes of kernel_neon_begin() and kernel_neon_end() to accommodate kernel mode FP/SIMD state buffers allocated on the stack, move arm64 to the new 'ksimd' scoped guard API, which encapsulates the calls to those functions. For symmetry, do the same for 32-bit ARM too. Reviewed-by: Eric Biggers Reviewed-by: Jonathan Cameron Acked-by: Catalin Marinas Signed-off-by: Ard Biesheuvel

lib/crc: Drop inline from all *_mod_init_arch() functions

2025-08-16T02:06:08Z

Drop 'inline' from all the *_mod_init_arch() functions so that the compiler will warn about any bugs where they are unused due to not being wired up properly. (There are no such bugs currently, so this just establishes a more robust convention for the future. Of course, these functions also tend to get inlined anyway, regardless of the keyword.) Link: https://lore.kernel.org/r/20250816020240.431545-1-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crc: Use underlying functions instead of crypto_simd_usable()

2025-08-11T18:28:00Z

Since crc_kunit now tests the fallback code paths without using crypto_simd_disabled_for_test, make the CRC code just use the underlying may_use_simd() and irq_fpu_usable() functions directly instead of crypto_simd_usable(). This eliminates an unnecessary layer. Take the opportunity to add likely() and unlikely() annotations as well. Link: https://lore.kernel.org/r/20250811182631.376302-4-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crc: arm64: Migrate optimized CRC code into lib/crc/

2025-06-30T16:31:57Z

Move the arm64-optimized CRC code from arch/arm64/lib/crc* into its new location in lib/crc/arm64/, and wire it up in the new way. This new way of organizing the CRC code eliminates the need to artificially split the code for each CRC variant into separate arch and generic modules, enabling better inlining and dead code elimination. For more details, see "lib/crc: Prepare for arch-optimized code in subdirs of lib/crc/". Reviewed-by: "Martin K. Petersen" Acked-by: Ingo Molnar Acked-by: "Jason A. Donenfeld" Link: https://lore.kernel.org/r/20250607200454.73587-5-ebiggers@kernel.org Signed-off-by: Eric Biggers