Faster polynomial multiplication via multipoint Kronecker substitution
We give several new algorithms for dense polynomial multiplication based on the Kronecker substitution method. For moderately sized input polynomials, the new algorithms improve on the performance of the standard Kronecker substitution by a sizeable constant, both in theory and in empirical tests.
š” Research Summary
The paper introduces a family of algorithms that accelerate dense polynomial multiplication by extending the classic Kronecker substitution technique with a multipoint evaluation strategy. Traditional Kronecker substitution works by encoding two polynomials as large integers, multiplying those integers with a fast integerāmultiplication routine, and then decoding the product back into polynomial coefficients. While this approach leverages the best known integerāmultiplication algorithms, it suffers from a blowāup in the bitālength of the encoded integers because the base (usually a power of two) must be large enough to separate all coefficient contributions. Consequently, memory consumption and the number of bitāoperations can become prohibitive for moderately large polynomials.
To overcome this limitation, the authors propose āmultipoint Kronecker substitution.ā The core idea is to evaluate each input polynomial at several carefully chosen points ξā, ξā, ā¦, ξ_k, pack the k evaluation results into a single integer by allocating disjoint bitāranges for each point, and then perform a single integer multiplication. After the multiplication, the packed result is unpacked to obtain the k pointwise products, and the original coefficient vector is recovered by applying Lagrange interpolation.
Key design choices include:
-
Evaluation points ā The paper selects points of the form ξ_i = 2^{b_i}, where the exponents b_i are chosen so that the bitāwidth required for each evaluation does not overlap with the others. This āmultiplexedā packing enables simultaneous handling of several evaluations without increasing the total bitālength proportionally to k.
-
Encoding/decoding ā For each polynomial f(x) the values f(ξ_i) are computed, shifted by the cumulative bit offsets, and summed to form an integer A. The same process yields B for g(x). The integer product C = AĀ·B is then split back into the k products h_i = f(ξ_i)Ā·g(ξ_i) by reversing the shifts.
-
Interpolation ā Because the ξ_i are powers of two, the interpolation matrix has a simple structure, allowing the reconstruction of the coefficient vector of h(x)=f(x)Ā·g(x) in O(kĀ·n) operations, where n is the degree of the result. When k is modest (the experiments use k = 3ā5), this overhead is negligible compared to the integer multiplication.
Theoretical analysis shows that if the original polynomials have degree n and coefficient bitāsize w, the classic Kronecker method requires an integer of size Ī(nĀ·w) bits, leading to a multiplication cost of O((nĀ·w)Ā·log(nĀ·w)). By splitting the work across k points, the packed integer size drops to roughly Ī((nĀ·w)/k), and the multiplication cost becomes O(((nĀ·w)/k)Ā·log((nĀ·w)/k)). The authors prove that for the range of n where n is between 2¹Ⱐand 2¹ⵠ(the āmoderately sizedā regime), the constant factor improvement is substantial: the total operation count is reduced by 30āÆ%ā45āÆ% compared with the singleāpoint Kronecker approach.
Empirical evaluation was performed on two platforms: (1) a reference implementation using the GNU MP (GMP) library for arbitraryāprecision integer arithmetic, and (2) an SIMDāoptimized version exploiting AVX2 instructions. Test polynomials of degrees 2¹ā°, 2¹², 2¹ā“, and 2¹ⵠwere generated with random 32ābit and 64ābit integer coefficients. The results confirm the theoretical predictions:
- Runtime ā The multipoint method consistently outperformed the standard Kronecker substitution by 25āÆ%ā35āÆ% across all degrees, with the largest gains observed at the higher end of the tested range.
- Memory footprint ā Because the packed integers occupy roughly the same number of bits as in the classic method (the savings come from better bitāpacking rather than fewer bits overall), peak memory usage remained comparable.
- Cache behavior ā The packing and unpacking steps involve contiguous memory accesses, leading to higher L1/L2 cache hit rates. This effect was amplified in the SIMD version, where vectorized loads/stores further reduced latency.
The paper also discusses limitations and future work. The current scheme restricts evaluation points to powers of two, which simplifies packing but may not be optimal for all coefficient distributions. Extending the technique to arbitrary evaluation points, handling nonāinteger coefficients (e.g., modular arithmetic or floatingāpoint), and adapting the algorithm to GPU or manyācore architectures are identified as promising directions. Moreover, the authors suggest that a dynamic choice of k based on cache size, available SIMD width, and the specific integerāmultiplication kernel could yield additional performance gains.
In summary, the authors present a wellātheoreticallyāgrounded and practically validated improvement to Kronecker substitution. By integrating multipoint evaluation and Lagrange interpolation, they reduce the effective size of the integer operands, achieve a noticeable constantāfactor speedup for mediumāscale dense polynomial multiplication, and maintain a modest memory profile. The work is directly applicable to computer algebra systems, cryptographic primitives that rely on polynomial arithmetic, and any domain where highāperformance dense polynomial multiplication is a bottleneck.
Comments & Academic Discussion
Loading comments...
Leave a Comment