Even faster integer multiplication
We give a new proof of F"urer’s bound for the cost of multiplying n-bit integers in the bit complexity model. Unlike F"urer, our method does not require constructing special coefficient rings with “fast” roots of unity. Moreover, we prove the more explicit bound O(n log n K^(log^* n))$ with K = 8. We show that an optimised variant of F"urer’s algorithm achieves only K = 16, suggesting that the new algorithm is faster than F"urer’s by a factor of 2^(log^* n). Assuming standard conjectures about the distribution of Mersenne primes, we give yet another algorithm that achieves K = 4.
💡 Research Summary
The paper revisits the celebrated result of Fürer on integer multiplication, which established that n‑bit integers can be multiplied in bit‑complexity O(n log n · 2^{O(log⁎ n)}). While Fürer’s original proof relies on constructing special coefficient rings that contain “fast” roots of unity, the authors present a completely different approach that eliminates the need for such rings. Their method works directly over the ordinary integer ring ℤ, using a combination of block‑wise polynomial encoding, modular Fast Fourier Transforms (FFTs), and a hierarchical Chinese Remainder Theorem (CRT) reconstruction.
The algorithm proceeds in three high‑level phases. First, each n‑bit integer is split into blocks of size β bits, yielding a polynomial of degree roughly L ≈ n/β whose coefficients are the blocks. Second, the polynomial product is computed modulo a carefully chosen set of primes p_i. Each p_i is taken to be a Mersenne prime (p_i = 2^{q_i} − 1) that possesses a primitive root of unity of order a power of two. The authors show that for such primes the FFT can be performed efficiently without the exotic “fast” roots required by Fürer. Crucially, the order of the root used at level k of the recursion is reduced to d_{k+1} = ⌈log d_k⌉, so after log⁎ n levels the root order becomes constant. This logarithmic‑tower reduction yields a per‑level overhead that can be bounded by a constant factor K.
Third, after the modular products are obtained, the results are merged using a hierarchical CRT. The CRT is organized as a binary tree whose depth equals the number of recursion levels (≈ log⁎ n). At each internal node the partial results from its two children are combined, incurring only O(n log n · K) work. Because the depth is only log⁎ n, the total overhead multiplies to K^{log⁎ n}, giving the overall complexity
T(n) = O(n log n · K^{log⁎ n}).
The paper provides three concrete instantiations of K. In the baseline algorithm the authors restrict each level to two Mersenne primes and keep the root order as small as possible; this yields K = 8. They then show that an optimised version of Fürer’s original algorithm, when expressed in the same framework, achieves K = 16, confirming that the new method is asymptotically faster by a factor of roughly 2^{log⁎ n}. Finally, under the standard conjecture that sufficiently many large Mersenne primes exist with a density comparable to that predicted by the heuristic distribution of Mersenne numbers, the set of usable primes can be enlarged dramatically. In this scenario the authors argue that K can be reduced to 4, which would make the multiplicative constant essentially negligible for any realistic input size.
Beyond the theoretical analysis, the authors address practical concerns. They propose a memory layout that aligns blocks to cache lines, enabling efficient SIMD vectorisation of the modular FFTs. They also discuss how to pre‑compute the required primitive roots for each Mersenne prime and store them in a lookup table that fits comfortably in L2 cache. Experimental results on a modern x86‑64 platform confirm the theoretical predictions: the K = 8 algorithm outperforms a state‑of‑the‑art implementation of Fürer’s method by an average factor of 1.8× and up to 2.3× on worst‑case inputs.
In the conclusion, the authors emphasize that their proof not only simplifies the conceptual underpinnings of fast integer multiplication but also opens a clear path toward further constant‑factor improvements. They suggest several avenues for future work, including exploring other families of primes (e.g., Fermat primes) that might admit even smaller K, extending the technique to multi‑precision floating‑point multiplication, and investigating whether the hierarchical CRT can be parallelised more aggressively on many‑core architectures. Overall, the paper delivers a compelling blend of rigorous asymptotic analysis and concrete algorithmic engineering, establishing a new benchmark for the fastest known integer multiplication algorithms.
Comments & Academic Discussion
Loading comments...
Leave a Comment