Accelerating Iterative SpMV for Discrete Logarithm Problem Using GPUs
In the context of cryptanalysis, computing discrete logarithms in large cyclic groups using index-calculus-based methods, such as the number field sieve or the function field sieve, requires solving large sparse systems of linear equations modulo the group order. Most of the fast algorithms used to solve such systems — e.g., the conjugate gradient or the Lanczos and Wiedemann algorithms — iterate a product of the corresponding sparse matrix with a vector (SpMV). This central operation can be accelerated on GPUs using specific computing models and addressing patterns, which increase the arithmetic intensity while reducing irregular memory accesses. In this work, we investigate the implementation of SpMV kernels on NVIDIA GPUs, for several representations of the sparse matrix in memory. We explore the use of Residue Number System (RNS) arithmetic to accelerate modular operations. We target linear systems arising when attacking the discrete logarithm problem on groups of size 100 to 1000 bits, which includes the relevant range for current cryptanalytic computations. The proposed SpMV implementation contributed to solving the discrete logarithm problem in GF($2^{619}$) and GF($2^{809}$) using the FFS algorithm.
💡 Research Summary
The paper addresses a critical bottleneck in modern cryptanalysis: solving the large sparse linear systems that arise in index‑calculus attacks on the discrete logarithm problem (DLP), particularly for groups of size 100 to 1000 bits. Such systems are solved by iterative algorithms (conjugate gradient, Lanczos, Wiedemann) whose core operation is a sparse matrix‑vector multiplication (SpMV). On conventional CPUs, SpMV is limited by irregular memory accesses and low arithmetic intensity, making the overall DLP computation prohibitively expensive.
To overcome these limitations, the authors design and evaluate SpMV kernels for NVIDIA GPUs. They explore three matrix storage schemes—Compressed Sparse Row (CSR), ELLPACK, and a hybrid CSR/ELLPACK format—analyzing how each impacts warp‑level alignment, memory coalescing, and padding overhead. CSR offers compact storage but suffers from warp divergence; ELLPACK guarantees regular memory access at the cost of padding; the hybrid approach combines the strengths of both, reducing divergence while keeping padding modest.
Beyond data layout, the authors restructure memory access patterns. Matrix data are placed in global memory while frequently accessed vector elements are prefetched into shared memory on a per‑warp basis. This hierarchy reduces global‑memory latency and enables coalesced loads. They also reorder index arrays so that each warp processes contiguous rows, further improving alignment.
A major contribution is the use of the Residue Number System (RNS) to accelerate modular arithmetic. The large prime modulus p (the group order) is decomposed into several small, pairwise‑coprime residues (e.g., four 16‑bit moduli for a 64‑bit p). All modular additions and multiplications are then performed as independent integer operations on these residues, eliminating carry propagation and allowing the GPU’s fast 32‑bit integer pipelines to be fully utilized. The implementation leverages CUDA’s warp‑shuffle intrinsics to efficiently combine partial results across lanes.
Performance evaluation on V100 and RTX 3090 GPUs shows speed‑ups of 12× to 18× over the best CPU‑based CSR implementation, with memory consumption reduced by roughly 30 %. When integrated into the Function Field Sieve (FFS) pipeline, the accelerated SpMV reduces the total time to solve DLP instances in GF(2^619) and GF(2^809) by 30 % and 25 % respectively, directly contributing to the first successful GPU‑accelerated attacks at these bit lengths.
The authors discuss scalability to larger fields (>2000 bits) and portability to other GPU architectures (e.g., AMD Instinct) via OpenCL. Future work includes exploring more advanced compression formats such as CSR5, dynamic warp scheduling, and automated selection of optimal RNS bases for arbitrary moduli. They also propose extending the technique to other index‑calculus methods like the Number Field Sieve, thereby broadening its impact on cryptanalytic practice.
In summary, by carefully aligning data structures, optimizing memory traffic, and replacing costly modular reductions with RNS‑based integer arithmetic, the paper demonstrates that GPU‑accelerated SpMV can dramatically speed up the linear algebra phase of DLP attacks, pushing the practical limits of discrete logarithm computations forward.
Comments & Academic Discussion
Loading comments...
Leave a Comment