Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal foundation models are increasingly trained on sensitive data across domains such as finance, biomedicine, and personal identifiers. However, this distributed setup raises serious privacy concerns due to the need for cross-partition data sharing. Split learning addresses these concerns by enabling collaborative model training without raw data exchange between partitions, yet it introduces a significant challenge: transmitting high-dimensional intermediate feature representations between partitions leads to substantial communication costs. To address this challenge, we propose Quantized-TinyLLaVA, a multimodal foundation model with an integrated communication-efficient split learning framework. Our approach adopts a compression module that quantizes intermediate feature into discrete representations before transmission, substantially reducing communication overhead. Besides, we derive a principled quantization strategy grounded in entropy coding theory to determine the optimal number of discrete representation levels. We deploy our framework in a two-partition setting, with one partition operating as the client and the other as the server, to realistically simulate distributed training. Under this setup, Quantized-TinyLLaVA achieves an approximate \textbf{87.5%} reduction in communication overhead with 2-bit quantization, while maintaining performance of the original 16-bit model across five benchmark datasets. Furthermore, our compressed representations exhibit enhanced resilience against feature inversion attacks, validating the privacy of transmission. The code is available at https://github.com/anonymous-1742/Quantized-TinyLLaVA.

💡 Research Summary

The paper introduces Quantized‑TinyLLaVA, a multimodal foundation model designed to operate efficiently under split‑learning scenarios where raw data never leaves the client device. The authors identify two major bottlenecks in split learning: (1) the high communication cost of transmitting high‑dimensional intermediate activations between client and server, and (2) the vulnerability of these activations to feature‑inversion attacks that can leak private information. To address both issues, they embed a compression module directly into the TinyLLaVA architecture, which consists of a vision encoder, a modality‑alignment connector, and a large language model for downstream tasks.

The compression module combines two complementary techniques. First, they adapt the block‑wise double‑quantization strategy from QLoRA, originally used for weight compression, to quantize intermediate features. This method, generalized to arbitrary b‑bit precision (e.g., 2‑bit, 4‑bit), learns per‑block scaling factors and offsets, enabling low‑bit transmission while preserving the statistical properties of the features. Second, they propose Robust and Distortion‑aware Finite Scalar Quantization (RD‑FSQ), an improved version of the earlier FSQ method. RD‑FSQ replaces the tanh scaling with a linear scaling that includes outlier clipping (μ ± 3σ), and introduces a commitment loss based on cosine similarity (L_comm) to penalize rounding distortion. The total loss is a weighted sum of the standard cross‑entropy loss and α·L_comm, allowing gradients to flow through the quantization step via a straight‑through estimator.

A key theoretical contribution is the derivation of an entropy‑coding‑based criterion for selecting the optimal number of discrete representation levels d. By estimating the entropy H of the activation distribution and enforcing H ≤ log₂ d, the method determines the minimal bit‑width required to encode the features without excessive information loss. This eliminates costly trial‑and‑error tuning of quantization levels.

Experiments are conducted on five multimodal benchmark tasks (visual question answering, image captioning, cross‑modal retrieval, etc.) using a two‑partition split‑learning setup (client and server). The authors compare Quantized‑TinyLLaVA against the original 16‑bit model, vanilla FSQ, Randomized Top‑K sparsification, SplitFC, and QLoRA‑only quantization. With 2‑bit RD‑FSQ, they achieve an average 87.5 % reduction in communication volume and a comparable reduction in transmission time, while the task performance drops by less than 0.3 % relative to the full‑precision baseline—statistically insignificant.

Security evaluation involves simulated feature‑inversion attacks that attempt to reconstruct the original image from the transmitted compressed activations. RD‑FSQ consistently yields the lowest reconstruction quality (PSNR < 15 dB), indicating strong resistance, whereas FSQ shows higher PSNR and thus weaker protection.

The paper acknowledges limitations: the current validation is limited to a single client‑server split; extending to multiple participants or asynchronous training may introduce new synchronization and bandwidth challenges. Moreover, while 2‑bit quantization works well, pushing to 1‑bit could destabilize training, especially for high‑resolution visual inputs.

In summary, Quantized‑TinyLLaVA demonstrates that carefully designed quantization and discrete representation learning can dramatically cut communication overhead and enhance privacy in split‑learning environments without sacrificing the accuracy of multimodal foundation models. The work provides both a practical system (RD‑FSQ + QLoRA‑style quantization) and a theoretical framework (entropy‑based level selection), paving the way for privacy‑preserving, bandwidth‑efficient deployment of large multimodal models in sensitive domains such as finance and biomedicine.

Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment