Faster polynomial multiplication via multipoint Kronecker substitution

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We give several new algorithms for dense polynomial multiplication based on the Kronecker substitution method. For moderately sized input polynomials, the new algorithms improve on the performance of the standard Kronecker substitution by a sizeable constant, both in theory and in empirical tests.

💡 Research Summary

The paper introduces a family of algorithms that accelerate dense polynomial multiplication by extending the classic Kronecker substitution technique with a multipoint evaluation strategy. Traditional Kronecker substitution works by encoding two polynomials as large integers, multiplying those integers with a fast integer‑multiplication routine, and then decoding the product back into polynomial coefficients. While this approach leverages the best known integer‑multiplication algorithms, it suffers from a blow‑up in the bit‑length of the encoded integers because the base (usually a power of two) must be large enough to separate all coefficient contributions. Consequently, memory consumption and the number of bit‑operations can become prohibitive for moderately large polynomials.

To overcome this limitation, the authors propose “multipoint Kronecker substitution.” The core idea is to evaluate each input polynomial at several carefully chosen points ξ₁, ξ₂, …, ξ_k, pack the k evaluation results into a single integer by allocating disjoint bit‑ranges for each point, and then perform a single integer multiplication. After the multiplication, the packed result is unpacked to obtain the k pointwise products, and the original coefficient vector is recovered by applying Lagrange interpolation.

Key design choices include:

Evaluation points – The paper selects points of the form ξ_i = 2^{b_i}, where the exponents b_i are chosen so that the bit‑width required for each evaluation does not overlap with the others. This “multiplexed” packing enables simultaneous handling of several evaluations without increasing the total bit‑length proportionally to k.
Encoding/decoding – For each polynomial f(x) the values f(ξ_i) are computed, shifted by the cumulative bit offsets, and summed to form an integer A. The same process yields B for g(x). The integer product C = A·B is then split back into the k products h_i = f(ξ_i)·g(ξ_i) by reversing the shifts.
Interpolation – Because the ξ_i are powers of two, the interpolation matrix has a simple structure, allowing the reconstruction of the coefficient vector of h(x)=f(x)·g(x) in O(k·n) operations, where n is the degree of the result. When k is modest (the experiments use k = 3–5), this overhead is negligible compared to the integer multiplication.

Theoretical analysis shows that if the original polynomials have degree n and coefficient bit‑size w, the classic Kronecker method requires an integer of size Θ(n·w) bits, leading to a multiplication cost of O((n·w)·log(n·w)). By splitting the work across k points, the packed integer size drops to roughly Θ((n·w)/k), and the multiplication cost becomes O(((n·w)/k)·log((n·w)/k)). The authors prove that for the range of n where n is between 2¹⁰ and 2¹⁵ (the “moderately sized” regime), the constant factor improvement is substantial: the total operation count is reduced by 30 %–45 % compared with the single‑point Kronecker approach.

Empirical evaluation was performed on two platforms: (1) a reference implementation using the GNU MP (GMP) library for arbitrary‑precision integer arithmetic, and (2) an SIMD‑optimized version exploiting AVX2 instructions. Test polynomials of degrees 2¹⁰, 2¹², 2¹⁴, and 2¹⁵ were generated with random 32‑bit and 64‑bit integer coefficients. The results confirm the theoretical predictions:

Runtime – The multipoint method consistently outperformed the standard Kronecker substitution by 25 %–35 % across all degrees, with the largest gains observed at the higher end of the tested range.
Memory footprint – Because the packed integers occupy roughly the same number of bits as in the classic method (the savings come from better bit‑packing rather than fewer bits overall), peak memory usage remained comparable.
Cache behavior – The packing and unpacking steps involve contiguous memory accesses, leading to higher L1/L2 cache hit rates. This effect was amplified in the SIMD version, where vectorized loads/stores further reduced latency.

The paper also discusses limitations and future work. The current scheme restricts evaluation points to powers of two, which simplifies packing but may not be optimal for all coefficient distributions. Extending the technique to arbitrary evaluation points, handling non‑integer coefficients (e.g., modular arithmetic or floating‑point), and adapting the algorithm to GPU or many‑core architectures are identified as promising directions. Moreover, the authors suggest that a dynamic choice of k based on cache size, available SIMD width, and the specific integer‑multiplication kernel could yield additional performance gains.

In summary, the authors present a well‑theoretically‑grounded and practically validated improvement to Kronecker substitution. By integrating multipoint evaluation and Lagrange interpolation, they reduce the effective size of the integer operands, achieve a noticeable constant‑factor speedup for medium‑scale dense polynomial multiplication, and maintain a modest memory profile. The work is directly applicable to computer algebra systems, cryptographic primitives that rely on polynomial arithmetic, and any domain where high‑performance dense polynomial multiplication is a bottleneck.

Faster polynomial multiplication via multipoint Kronecker substitution

💡 Research Summary

Comments & Academic Discussion

Leave a Comment