Optimized Password Recovery for Encrypted RAR on GPUs

RAR uses classic symmetric encryption algorithm SHA-1 hashing and AES algorithm for encryption, and the only method of password recovery is brute force, which is very time-consuming. In this paper, we present an approach using GPUs to speed up the password recovery process. However, because the major calculation and time-consuming part, SHA-1 hashing, is hard to be parallelized, so this paper adopts coarse granularity parallel. That is, one GPU thread is responsible for the validation of one password. We mainly use three optimization methods to optimize this parallel version: asynchronous parallel between CPU and GPU, redundant calculations and conditional statements reduction, and the usage of registers optimization. Experiment result shows that the final version reaches 43~57 times speedup on an AMD FirePro W8000 GPU, compared to a well-optimized serial version on Intel Core i5 CPU.

💡 Research Summary

The paper addresses the problem of recovering passwords for encrypted RAR archives, which rely on a combination of SHA‑1 based key‑stretching and AES‑128‑CBC encryption. Because the only viable recovery method is brute‑force, the computational cost is dominated by the repeated SHA‑1 hashing (2,048 iterations per candidate password). The authors observe that SHA‑1’s internal state dependencies make fine‑grained parallelism on GPUs impractical, so they adopt a coarse‑grained strategy where each GPU thread processes an entire password candidate.

Three main optimization techniques are introduced. First, an asynchronous CPU‑GPU pipeline is built: the CPU generates batches of candidate passwords, streams them to the GPU while the GPU is already processing previous batches, and receives results back asynchronously. This overlap hides data‑transfer latency and keeps both processors busy. Second, the authors eliminate redundant calculations and reduce conditional branches. Constants such as the SHA‑1 initial values and round constants are pre‑loaded into GPU constant memory, and common intermediate hash values are reused across rounds. Conditional checks (e.g., padding validation) are transformed into branch‑free logical operations, minimizing warp divergence. Third, they maximize register usage. The five working words of SHA‑1 (A, B, C, D, E) are kept entirely in registers throughout the 2,048‑round loop, avoiding spills to local or shared memory and thereby cutting memory‑access latency.

Implementation details are provided for an AMD FirePro W8000 GPU using OpenCL. Each work‑item receives a password string, performs the full key‑stretching, decrypts a small block with AES, and checks whether the result matches the expected header. Successful candidates are sent back to the host for reporting. The authors tune work‑group size and batch size to balance memory bandwidth and compute unit utilization.

Experimental evaluation compares three configurations: a well‑optimized serial implementation on an Intel Core i5‑6600K CPU, a naïve GPU port without the three optimizations, and the fully optimized GPU version. The naïve GPU already achieves roughly a 10× speedup over the CPU. After applying all optimizations, the speedup rises to between 43× and 57×, depending on password length (the best gains are observed for 8‑12‑character passwords, where the SHA‑1 workload is largest). Power measurements show the GPU consumes about 30 % less energy for the same workload, indicating better energy efficiency.

The discussion acknowledges several limitations. GPU memory limits the number of candidates that can be processed concurrently, which may become a bottleneck for very large search spaces. The approach is tailored to SHA‑1; newer RAR versions that employ stronger key‑derivation functions such as PBKDF2 would require a different optimization strategy. Additionally, workload imbalance (e.g., when many candidates share common prefixes) can cause warp under‑utilization, suggesting the need for dynamic scheduling mechanisms. The authors propose future work on memory‑compression techniques, extending the framework to other KDFs, and scaling across multiple GPUs.

In conclusion, the study demonstrates that coarse‑grained GPU parallelism, combined with asynchronous execution, redundant‑operation elimination, and aggressive register utilization, can dramatically accelerate brute‑force password recovery for encrypted RAR archives. The reported 43‑57× speedup makes GPU‑based recovery a practical tool for forensic analysts and data‑recovery specialists, and the methodology provides a foundation for extending GPU acceleration to other computationally intensive cryptographic cracking tasks.

💡 Research Summary

📜 Original Paper Content