LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large multimodal models (LMMs) have achieved impressive performance on various vision-language tasks, but their substantial computational and memory costs hinder their practical deployment. Existing compression methods often decouple low-rank decomposition and quantization, leading to compounded reconstruction errors, especially in multimodal architectures with cross-modal redundancy. To address this issue, we propose LLaVA-FA, a novel efficient LMM that performs joint low-rank plus quantization approximation in the frequency domain. By leveraging the de-correlation and conjugate symmetry properties of Fourier transform, LLaVA-FA achieves more compact and accurate weight representations. Furthermore, we introduce PolarQuant, a polar-coordinate quantization method tailored for complex matrices, and an optional diagonal calibration (ODC) scheme that eliminates the need for large-scale calibration data. Extensive experimental results demonstrate that our proposed LLaVA-FA outperforms existing efficient multimodal models across multiple benchmarks while maintaining minimal activated parameters and low computational costs, validating its effectiveness as a powerful solution for compressing LMMs.

💡 Research Summary

LLaVA‑FA introduces a novel compression framework for large multimodal models (LMMs) that jointly performs low‑rank approximation and quantization in the frequency domain. The authors observe that weight matrices of LMMs exhibit strong correlations in the spatial domain, which can be effectively decorrelated by applying a 2‑D Fourier transform. In the frequency domain, singular values decay more rapidly and the resulting complex coefficients possess conjugate symmetry, allowing storage of only half the spectrum without loss of information. Leveraging these properties, LLaVA‑FA approximates each weight matrix W as a sum of a low‑rank component L₁L₂ and a quantized residual Q, all represented as complex matrices after Fourier transformation.

The low‑rank factorization is obtained via Fourier‑SVD, optionally enhanced by a diagonal calibration (ODC) step that approximates the Hessian using row and column means, thereby eliminating the need for large calibration datasets. The residual Q is quantized using the newly proposed PolarQuant codec, which separates amplitude and phase into independent bit‑widths (b_r for amplitude, b_θ for phase). This polar‑coordinate quantization preserves the complex structure and reduces quantization error, while conjugate symmetry further halves the storage requirement.

A theoretical analysis (Lemma 3.1) proves that, for a given rank r, the Frobenius error of the frequency‑domain truncation is strictly smaller than that of a spatial‑domain truncation, assuming the singular values decay faster after Fourier transformation. The authors also derive an average bit‑budget formula showing that when the target rank k satisfies k < (1 – B_Q/B_L)·(∑d_i₁d_i₂)/(∑(d_i₁ + d_i₂)), the overall average bits per parameter B_avg are lower than full‑precision training.

Empirically, LLaVA‑FA is evaluated on a suite of vision‑language benchmarks, including GQA, VQA, MMB, and hallucination‑oriented tests. Compared with state‑of‑the‑art efficient LMMs (e.g., MiniCPM‑V, LLaVA‑1.5‑7B) and even larger models such as Qwen‑VL‑Chat and DeepSeek‑VL, LLaVA‑FA achieves comparable or superior accuracy while using only 0.25 % of the training data and 0.3 % of the trainable parameters. Training cost analysis shows that a 70‑billion‑parameter model can be fine‑tuned with dramatically reduced GPU hours, and inference memory footprints are cut by nearly 50 % thanks to the half‑spectrum storage.

Importantly, LLaVA‑FA does not require token pruning, architectural modifications, or large calibration corpora, making it a drop‑in replacement for existing LMMs. The method compresses both the vision encoder and the language decoder simultaneously, addressing the “cross‑modal adapter bloat” problem that plagues many multimodal compression attempts.

In summary, LLaVA‑FA demonstrates that Fourier‑domain decorrelation, conjugate symmetry, and energy concentration can be harnessed to unify low‑rank adaptation and quantization into a single, jointly optimized compression objective. The introduction of PolarQuant and the optional diagonal calibration further stabilizes the low‑bit reconstruction of complex weights. This work sets a new benchmark for efficient, high‑performance multimodal model deployment, offering a practical pathway to bring large LMMs into resource‑constrained environments.

LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment