Analysis of RSA algorithm using GPU programming

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Modern-day computer security relies heavily on cryptography as a means to protect the data that we have become increasingly reliant on. The main research in computer security domain is how to enhance the speed of RSA algorithm. The computing capability of Graphic Processing Unit as a co-processor of the CPU can leverage massive-parallelism. This paper presents a novel algorithm for calculating modulo value that can process large power of numbers which otherwise are not supported by built-in data types. First the traditional algorithm is studied. Secondly, the parallelized RSA algorithm is designed using CUDA framework. Thirdly, the designed algorithm is realized for small prime numbers and large prime number . As a result the main fundamental problem of RSA algorithm such as speed and use of poor or small prime numbers that has led to significant security holes, despite the RSA algorithm’s mathematical soundness can be alleviated by this algorithm.

💡 Research Summary

The paper addresses the well‑known performance bottleneck of RSA encryption and decryption, which stems from the need to compute modular exponentiation with very large integers. While many prior works have focused on mathematical optimizations (e.g., Montgomery multiplication, Chinese Remainder Theorem) or on dedicated hardware such as FPGAs and ASICs, this study investigates how a commodity graphics processing unit (GPU) can be leveraged as a co‑processor to accelerate RSA operations using the CUDA programming model.

The authors first review the classic square‑and‑multiply algorithm and its limitations on a CPU, especially the inability of native data types to hold numbers larger than 64 bits. To overcome this, they design a custom multi‑precision representation that splits a large integer into 32‑bit words. Each word is stored in GPU global memory, while intermediate results are kept in shared memory and registers to minimize latency. The core contribution is a parallel modular exponentiation kernel: the exponentiation loop is unrolled across CUDA threads, each thread handling a subset of the squaring and multiplication steps. Montgomery reduction is adapted to the GPU context so that after each multiplication the result is immediately reduced modulo the RSA modulus, eliminating the need for costly division operations and reducing memory traffic.

Two experimental scenarios are presented. In the first, the authors use small primes (resulting in 512‑bit RSA keys) to validate correctness and to compare raw speed against a single‑core CPU implementation. The GPU version achieves roughly a ten‑fold speedup for both encryption and decryption. In the second scenario, they scale to larger primes, testing 1024‑bit and 2048‑bit keys. Here the speedup drops to a more modest 3–5×, but the absolute execution times are still dramatically lower than the CPU baseline, especially for the encryption phase where the exponent is public and can be pre‑processed. The authors argue that this performance gain makes it feasible to employ larger moduli in practice, thereby mitigating the security risks associated with using small primes.

Despite these promising results, the paper has several notable shortcomings. First, the experimental methodology lacks detailed hardware specifications (GPU model, memory bandwidth, clock rates) and does not report standard deviation or confidence intervals, making reproducibility difficult. Second, the implementation appears to ignore the overhead of carry propagation across word boundaries, a critical issue in multi‑precision arithmetic that can affect both correctness and timing consistency. Third, the memory footprint grows linearly with the number of words, which may exceed the shared memory limits of current GPUs for keys larger than 4096 bits. Finally, the security discussion is superficial: while larger keys are indeed more resistant to factorization attacks, the paper does not address side‑channel vulnerabilities (e.g., timing or power analysis) that can be exacerbated by irregular GPU execution patterns.

The authors conclude by outlining future work: integrating more sophisticated carry‑handling schemes, exploring hierarchical memory layouts to support ultra‑large keys, and adding countermeasures against side‑channel attacks. Overall, the study provides a valuable proof‑of‑concept that CUDA‑enabled GPUs can substantially accelerate RSA modular exponentiation, offering a practical path toward higher‑throughput cryptographic services without sacrificing the mathematical security guarantees of RSA.

Analysis of RSA algorithm using GPU programming

💡 Research Summary

Comments & Academic Discussion

Leave a Comment