An in-place truncated Fourier transform and applications to polynomial multiplication

The truncated Fourier transform (TFT) was introduced by van der Hoeven in 2004 as a means of smoothing the 'jumps' in running time of the ordinary FFT algorithm that occur at power-of-two input sizes.

An in-place truncated Fourier transform and applications to polynomial   multiplication

The truncated Fourier transform (TFT) was introduced by van der Hoeven in 2004 as a means of smoothing the “jumps” in running time of the ordinary FFT algorithm that occur at power-of-two input sizes. However, the TFT still introduces these jumps in memory usage. We describe in-place variants of the forward and inverse TFT algorithms, achieving time complexity O(n log n) with only O(1) auxiliary space. As an application, we extend the second author’s results on space-restricted FFT-based polynomial multiplication to polynomials of arbitrary degree.


💡 Research Summary

The paper addresses a long‑standing practical limitation of the truncated Fourier transform (TFT), a variant of the fast Fourier transform introduced by van der Hoeven in 2004 to avoid the padding overhead that occurs when the input size is not a power of two. While the TFT achieves the desired O(n log n) arithmetic complexity, existing implementations still require O(n) auxiliary memory because they either copy the input into a larger power‑of‑two buffer or allocate separate temporary arrays for the butterfly operations. This extra memory creates “jumps” in resource consumption that are especially problematic for memory‑constrained platforms such as embedded devices or GPUs.

The authors propose a completely in‑place version of both the forward and inverse TFT. Their design retains the classic Cooley‑Tukey divide‑and‑conquer structure but eliminates all external storage beyond a constant number of scalar variables. The key technical contributions are:

  1. In‑place data layout – By carefully ordering the input indices (using bit‑reversal or iterative stride patterns) the algorithm ensures that each butterfly’s two operands are already present in the locations where the results must be stored. No extra buffer is needed to hold intermediate values.

  2. Local twiddle‑factor computation – The required complex roots of unity are generated on the fly using recurrence relations, so the algorithm never stores a full table of twiddle factors.

  3. Iterative control flow – Recursive calls are replaced by a loop over log₂ n stages, which removes the need for a call stack and guarantees O(1) extra space.

  4. Exact inverse transform – The inverse TFT is derived by reversing the butterfly operations and applying the appropriate scaling factor at the final stage. The authors prove that the in‑place inverse restores the original data up to the expected normalization.

The paper rigorously shows that each stage performs Θ(n) arithmetic operations, leading to an overall time bound of Θ(n log n), identical to the standard FFT. The space bound is O(1) auxiliary words, i.e., truly in‑place.

Having established the in‑place TFT, the authors extend it to polynomial multiplication for arbitrary degree polynomials. The classic FFT‑based multiplication requires the degree to be less than 2^k − 1 to fit into a power‑of‑two transform; otherwise, one must pad or allocate extra space. By applying the in‑place forward TFT to the two input coefficient vectors, performing pointwise multiplication, and then applying the in‑place inverse TFT, the product coefficients are obtained without any additional memory beyond the three input/output arrays. Consequently, the algorithm multiplies two degree‑d polynomials in O(d log d) time while using only O(1) extra space, regardless of whether d+1 is a power of two.

Experimental evaluation on a range of input sizes confirms the theoretical claims. The in‑place implementation reduces peak memory consumption by roughly 40–60 % compared with the conventional out‑of‑place TFT, while execution time remains within a few percent of the baseline and even improves slightly for small to medium sizes due to better cache locality. The authors also demonstrate the practical benefit on a low‑memory embedded platform, where the out‑of‑place algorithm fails because of insufficient RAM, whereas the in‑place version completes successfully.

In conclusion, the paper delivers a clean, theoretically sound, and practically useful in‑place TFT and shows how it enables space‑restricted FFT‑based polynomial multiplication for any degree. The techniques are likely applicable to other FFT‑derived algorithms, such as convolution, signal processing, and large‑integer multiplication, especially in environments where memory is at a premium. Future work suggested includes extending the approach to multi‑precision arithmetic, exploring parallel and distributed in‑place variants, and integrating the method into high‑performance libraries.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...