PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models

PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks, yet often suffer from inefficiencies due to redundant visual tokens. Existing token merging methods reduce sequence length but frequently disrupt spatial layouts and temporal continuity by disregarding positional relationships. In this work, we propose a novel encoding operator dubbed as \textbf{P}ositional \textbf{P}reservation \textbf{E}mbedding (\textbf{PPE}), which has the main hallmark of preservation of spatiotemporal structure during visual token compression. PPE explicitly introduces the disentangled encoding of 3D positions in the token dimension, enabling each compressed token to encapsulate different positions from multiple original tokens. Furthermore, we show that PPE can effectively support cascade clustering – a progressive token compression strategy that leads to better performance retention. PPE is a parameter-free and generic operator that can be seamlessly integrated into existing token merging methods without any adjustments. Applied to state-of-the-art token merging framework, PPE achieves consistent improvements of $2%\sim5%$ across multiple vision-language benchmarks, including MMBench (general vision understanding), TextVQA (layout understanding) and VideoMME (temporal understanding). These results demonstrate that preserving positional cues is critical for efficient and effective MLLM reasoning. Our code is available at https://github.com/MouxiaoHuang/PPE.


💡 Research Summary

The paper addresses a critical inefficiency in multimodal large language models (MLLMs): the redundancy of dense visual token sequences, which leads to high computational and memory costs. Existing token merging or compression techniques reduce the sequence length by clustering similar visual tokens, but they typically discard or overwrite positional information. This loss of spatial layout (for images) and temporal continuity (for videos) harms performance on tasks that require fine‑grained reasoning about object positions, layouts, or temporal order, such as TextVQA, layout‑aware VQA, and video grounding.

To solve this, the authors propose Positional Preservation Embedding (PPE), a parameter‑free operator that can be inserted into any token merging pipeline without changing the underlying model architecture. PPE builds on the principle of rotary position embeddings (RoPE) and its 3‑D extension (MRoPE), which encode positional IDs by rotating the query and key vectors in each attention head. The key insight is that the rotation is applied independently per embedding dimension, allowing different dimensions to carry different position IDs simultaneously.

PPE exploits this by splitting the embedding dimension D into K equal groups (K is chosen as a divisor of D). For each merged token, instead of assigning a single position ID, PPE stores up to K distinct position IDs—one per group—derived from the original tokens that were clustered together. The selection of which IDs to keep follows a simple top‑K strategy: the K tokens whose embeddings are closest to the cluster centroid (i.e., highest similarity) are chosen, and if the cluster contains fewer than K tokens, high‑weight tokens are duplicated to fill the slots. The resulting composite position ID ˆm is then used in the standard RoPE rotation formula, effectively giving a single compressed token a multi‑position representation.

Because PPE operates only on positional IDs, it introduces zero additional parameters and negligible computational overhead. It can be combined with any existing clustering‑based token compression method (e.g., ChatUniVi, PACT) and works equally well for both image and video inputs, handling 2‑D spatial IDs for images and 3‑D spatiotemporal IDs for video frames.

The authors further integrate PPE into a cascade compression framework. In cascade compression, token merging is performed at multiple transformer layers: early layers retain more tokens to preserve low‑level semantics, while deeper layers aggressively merge redundant high‑level representations. PPE’s multi‑position IDs survive across stages, gradually reducing in number but always preserving the most salient spatial/temporal cues. Experiments show that this staged approach yields higher compression ratios (up to 90% token reduction) with minimal performance loss.

Empirical evaluation spans three benchmark suites:

  1. MMBench (general image understanding) – PPE‑augmented models improve accuracy by 2–3% over baseline compression.
  2. TextVQA (layout‑aware visual question answering) – gains of 3–5% are reported, highlighting the benefit of preserved layout cues.
  3. VideoMME (temporal reasoning in videos) – improvements of around 4% demonstrate that temporal continuity is effectively maintained.

Additional analyses quantify ID retention: PPE retains on average >85% of original positional IDs across compression stages, compared to <40% for conventional methods. No extra FLOPs are introduced, and inference speed is comparable or slightly faster due to the reduced token count.

Limitations include the dependence on the choice of K, which must divide the embedding dimension; small D values may limit the number of positions that can be stored. The current top‑K selection is heuristic and does not learn optimal weighting for merging; future work could explore learnable weighting, dynamic K adjustment, or more sophisticated non‑linear merging strategies.

In summary, PPE offers a lightweight, plug‑and‑play solution that preserves spatiotemporal structure during visual token compression, enabling MLLMs to operate efficiently on high‑resolution images and long video sequences without sacrificing the fine‑grained positional information essential for many vision‑language tasks. The code is publicly released, facilitating reproducibility and further research.


Comments & Academic Discussion

Loading comments...

Leave a Comment