We present an algorithm to perform a simultaneous modular reduction of several residues. This algorithm is applied fast modular polynomial multiplication. The idea is to convert the $X$-adic representation of modular polynomials, with $X$ an indeterminate, to a $q$-adic representation where $q$ is an integer larger than the field characteristic. With some control on the different involved sizes it is then possible to perform some of the $q$-adic arithmetic directly with machine integers or floating points. Depending also on the number of performed numerical operations one can then convert back to the $q$-adic or $X$-adic representation and eventually mod out high residues. In this note we present a new version of both conversions: more tabulations and a way to reduce the number of divisions involved in the process are presented. The polynomial multiplication is then applied to arithmetic in small finite field extensions.
Deep Dive into Q-adic Transform revisited.
We present an algorithm to perform a simultaneous modular reduction of several residues. This algorithm is applied fast modular polynomial multiplication. The idea is to convert the $X$-adic representation of modular polynomials, with $X$ an indeterminate, to a $q$-adic representation where $q$ is an integer larger than the field characteristic. With some control on the different involved sizes it is then possible to perform some of the $q$-adic arithmetic directly with machine integers or floating points. Depending also on the number of performed numerical operations one can then convert back to the $q$-adic or $X$-adic representation and eventually mod out high residues. In this note we present a new version of both conversions: more tabulations and a way to reduce the number of divisions involved in the process are presented. The polynomial multiplication is then applied to arithmetic in small finite field extensions.
The FFLAS/FFPACK project has demonstrated the usefulness of wrapping cache-aware routines for efficient small finite field linear algebra [4,5].
A conversion between a modular representation of prime fields and e.g. floating points used exactly is natural. It uses the homomorphism to the integers. Now for extension fields (isomorphic to polynomials over a prime field) such a conversion is not direct. In [4] we proposed transforming the polynomials into a q-adic representation where q is an integer larger than the field characteristic.
We call this transformation DQT for Discrete Q-adic Transform, it is a form of Kronecker substitution [7, §8.4]. With some care, in particular on the size of q, it is possible to map the operations in the extension field into the floating point arithmetic realization of this q-adic representation and convert back using an inverse DQT.
In this note we propose some implementation improvements: we propose to use a tabulated discrete logarithm for the DQT and we give a trick to reduce the number of machine divisions involved in the inverse. This then gives rise to an improved DQT which we thus call FQT (Fast Q-adic Transform). This FQT uses a simultaneous reduction of several residues, called REDQ, and some table lookup.
Therefore we recall in section 2 the previous conversion algorithm and discuss in section 3 about a floating point implementation of modular reduction. This implementation will be used throughout the paper to get fast reductions. We then present our new simultaneous reduction in section 4 and show in section 5 how a time-memory trade-off can make this reduction very fast. This fast reduction is then applied to modular polynomial multiplication with small prime fields in section 6. It is also applied to small extension field arithmetic and fast matrix multiplication over those fields in section 7.
We follow here the presentation of [4] of the idea of [12]: polynomial arithmetic is performed a q-adic way, with q a sufficiently big prime or power of a single prime.
Suppose that a = k-1 i=0 α i X i and b = k-1 i=0 β i X i are two polynomials in Z/ pZ [X]. One can perform the polynomial multiplication ab via q-adic numbers. Indeed, by setting ã = k-1 i=0 α i q i and b = k-1 i=0 β i q i , the product is computed in the following manner (we suppose that α i = β i = 0 for i > k -1):
Now if q is large enough, the coefficient of q i will not exceed q -1. In this case, it is possible to evaluate a and b as machine numbers (e.g. floating point or machine integers), compute the product of these evaluations, and convert back to polynomials by radix computations (see e.g. [7, Algorithm 9.14]). There just remains then to perform modulo p reductions on every coefficient as shown on example 1.
Example 1. For instance, to multiply a = X + 1 by b = X + 2 in Z/ 3Z [X] one can use the substitution X = 100: compute 101 × 102 = 10302, use radix conversion to write 10302 = q 2 +3q+2 and reduce modulo 3 to get a×b = X 2 +2.
We call DQT the evaluation of polynomials modulo p at q and DQT inverse the radix conversion of a q-adic development followed by a modular reduction, as shown in algorithm 1.
Algorithm 1 Polynomial multiplication by DQT Input Two polynomials v 1 and v 2 in Z/ pZ [X] of degree less than k. Input a sufficiently large integer q.
Polynomial to q-adic conversion 1: Set v 1 and v 2 to the floating point vectors of the evaluations at q of the elements of v 1 and v 2 . {Using e.g. Horner’s formula} One computation
Building the solution
Depending on the size of q, the results can still remain exact and we obtain the following bounds generalizing that of [7, §8.4]: Theorem 1. [4] Let m be the number of available mantissa bits within the machine numbers and n q be the number of polynomial products v 1 .v 2 of degree k accumulated before the re-conversion. If q > n q k(p -1) 2 and (2k -1) log 2 (q) < m,
then Algorithm 1 is correct.
Note that the integer q can be chosen to be a power of 2. Then the Horner like evaluation (line 1 of algorithm 1) of the polynomials at q is just a left shift. One can then compute this shift with exponent manipulations in floating point arithmetic and use native shift operator (e.g. the « operator in C) as soon as values are within the 32 (or 64 when available) bit range.
It is shown on [4,Figures 5 & 6] that this wrapping is already a pretty good way to obtain high speed linear algebra over some small extension fields. Indeed we were able to reach high peak performance, quite close to those obtained with prime fields, namely 420 Millions of finite operations per second (Mop/s) on a Pentium III, 735 MHz, and more than 500 Mop/s on a 64-bit DEC alpha 500 MHz. This was roughly 20 percent below the pure floating point performance and 15 percent below the prime field implementation.
In the implementations of the proposed subsequent algorithms, we will make extensive use of Euclidean division in exact arithmetic. Unfortunately exact divisio
…(Full text truncated)…
This content is AI-processed based on ArXiv data.