Memory-Efficient Training with In-Place FFT Implementation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Fast Fourier Transforms (FFT) are widely used to reduce memory and computational costs in deep learning. However, existing implementations, including standard FFT and real FFT (rFFT), cannot achieve true in-place computation. In particular, rFFT maps an input of size n to a complex output of size n/2+1, causing dimensional mismatch and requiring additional memory allocation. We propose the first real-domain, fully in-place FFT framework (rdFFT) that preserves input-output memory space consistency. By leveraging butterfly operation symmetry and conjugate properties in the frequency domain, we design an implicit complex encoding scheme that eliminates intermediate cache usage entirely. Experiments on multiple natural language understanding tasks demonstrate the method effectiveness in reducing training memory cost, offering a promising direction for frequency-domain lightweight adaptation.

💡 Research Summary

The paper addresses a critical bottleneck in training large neural networks: the memory overhead associated with Fourier Transform operations. While Fast Fourier Transform (FFT) and its real‑valued variant (rFFT) are widely used to accelerate models that rely on circulant or block‑circulant weight structures, existing libraries cannot perform a truly in‑place transformation. rFFT reduces the number of stored complex coefficients by exploiting Hermitian symmetry, but its output occupies N + 2 real slots for an input of length N, forcing an allocation of a larger buffer and breaking the in‑place paradigm.

To solve this, the authors propose rdFFT (real‑domain fully in‑place FFT), a novel algorithm that keeps the entire computation within the original N‑element real buffer. The key insight is that every sub‑FFT generated by the Cooley‑Tukey recursion inherits the conjugate symmetry of the original real input. By arranging each complex coefficient yₖ (1 ≤ k < N/2) so that its real part resides at index k and its imaginary part at the symmetric index N − k, the full spectrum can be stored in exactly N real numbers. The special coefficients y₀ and y_{N/2} are purely real and occupy single slots. This “squeeze N + 2 into N” layout eliminates the need for any auxiliary complex buffer.

The algorithm rewrites the classic butterfly operation to work exclusively with real‑real arithmetic. Twiddle factors are split into separate real and imaginary components, and the four‑element symmetric groups (two conjugate pairs) are updated simultaneously, preserving conjugate symmetry without ever materializing complex numbers. Because the butterfly is inherently in‑place, the transformed data overwrites the input directly, achieving zero‑allocation intermediate tensors.

A further contribution is full support for half‑precision (float16) data. Existing high‑performance FFT libraries (FFTW, cuFFT) lack native float16 complex handling, limiting their memory‑saving potential. rdFFT’s real‑only design sidesteps this limitation, allowing modern mixed‑precision training pipelines to benefit from the same memory reduction.

The authors integrate rdFFT into circulant‑based adapters for several transformer models (BERT‑Base, RoBERTa‑Large, DeBERTa‑V3) and evaluate on standard natural‑language‑understanding benchmarks. Compared with a baseline using standard rFFT, rdFFT yields identical or negligibly different accuracy (≤ 0.1 % drop) while reducing peak GPU memory consumption by 12 %–18 % in FP32 mode. In FP16 mode the reduction exceeds 20 %, enabling larger batch sizes or deeper models on the same hardware. Profiling shows that intermediate tensors for FFT/IFFT are no longer allocated, confirming the true in‑place nature of the implementation.

Limitations are acknowledged: the current design handles only 1‑D transforms, so extending to 2‑D FFTs (common in vision tasks) requires additional layout strategies. The custom memory layout and butterfly reformulation introduce implementation complexity, potentially raising the barrier for adoption in frameworks that rely on black‑box FFT primitives. For extremely large N (e.g., 2²⁰ and above) the index‑mapping overhead may become non‑trivial, though still far smaller than the memory saved.

Future work includes generalizing the approach to multi‑dimensional FFTs, integrating the operator into automatic‑differentiation graph optimizers, and exploring hardware‑specific implementations (ASIC, FPGA) that can exploit the reduced data movement.

In summary, rdFFT delivers a mathematically exact, fully in‑place real‑domain FFT that aligns input and output memory, supports half‑precision, and demonstrably cuts training memory usage for transformer‑style models without sacrificing performance. This makes it a valuable tool for researchers and engineers working with memory‑constrained large‑scale deep learning systems.

Memory-Efficient Training with In-Place FFT Implementation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment