GMSA: Enhancing Context Compression via Group Merging and Layer Semantic Alignment
Large Language Models (LLMs) have achieved remarkable performance across a wide range of Natural Language Processing (NLP) tasks. However, in long-context scenarios, they face two challenges: high computational cost and information redundancy. To address these challenges, we propose GMSA, an encoder-decoder context compression framework that generates a compact sequence of soft tokens for downstream tasks. GMSA introduces Group Merging to achieve more uniform aggregation, mitigating semantic dominance during autoencoder pretraining, and Layer Semantic Alignment (LSA) to bridge the semantic gap between high-level abstract semantics and low-level input semantics. We first pretrain GMSA as an autoencoder and then fine-tune it for downstream tasks. Experiments demonstrate that GMSA improves context reconstruction compared to existing soft prompt compression paradigm and outperforms baselines on multiple long-context question answering and summarization benchmarks across two backbone models, while maintaining low end-to-end latency.
💡 Research Summary
The paper addresses two fundamental challenges that arise when large language models (LLMs) process long contexts: the quadratic computational cost of the Transformer’s attention mechanism and the redundancy of information that can degrade performance. Existing soft‑prompt compression methods mitigate these issues by learning a short sequence of continuous (soft) tokens, but they suffer from two critical limitations. First, during auto‑encoder pre‑training the model tends to focus on a few “anchor” tokens, a phenomenon the authors call semantic dominance, which causes the semantics of many input tokens to be diluted in the learned summary vectors. Second, the high‑level abstract summary vectors are fed directly to the decoder, creating a large semantic gap between the representations of the encoder’s upper layers and the decoder’s lower layers; prior work typically uses a simple MLP to bridge this gap, which is insufficient.
GMSA (Group Merging and Layer Semantic Alignment) proposes two complementary mechanisms to overcome these problems. Group Merging partitions the encoder’s hidden states into equal‑sized groups according to the desired compression rate and averages each group’s token embeddings. This uniform aggregation prevents any single token from dominating the representation, thereby reducing semantic dominance during pre‑training. Layer Semantic Alignment (LSA) then projects the resulting high‑level summary vectors into the low‑level semantic space of the decoder. LSA is implemented as a small stack of Transformer blocks whose weights are copied from the first k decoder layers (the authors find a single layer sufficient). By passing the compressed vectors through LSA, the model aligns the abstract semantics with the decoder’s native representation space, effectively closing the cross‑layer semantic gap.
Training proceeds in two stages. In the auto‑encoder stage, only the encoder (fine‑tuned with LoRA) and the LSA module are trained to reconstruct the original text, using a token‑level cross‑entropy loss. This forces the compressed soft tokens to retain as much information as possible. In the downstream stage, the decoder is fine‑tuned on specific tasks (question answering, summarization) while the encoder and LSA are frozen; the decoder learns to extract knowledge from the compressed tokens.
Experiments were conducted on two backbone LLMs (Qwen‑3‑4B and LLaMA‑3.2‑3B) with compression ratios of 4× and 8×. The authors evaluate context reconstruction on the PwC dataset using BLEU, ROUGE, BERTScore, and Prefix Exact Match, and downstream performance on NaturalQuestions, HotpotQA, 2WikiMQA, NarrativeQA, and MultiNews using EM/F1. GMSA consistently outperforms baselines such as ICAE‑AE, 500xCompressor, and other recent soft‑prompt compressors, achieving 2–5 percentage‑point gains in reconstruction metrics and 3–6 percentage‑point improvements in downstream EM/F1. Moreover, inference latency is reduced by 15–20 % compared with prior methods, thanks to the lightweight averaging in Group Merging.
Ablation studies confirm the importance of both components: removing Group Merging (using naïve averaging) leads to a sharp drop in reconstruction quality, while omitting LSA re‑introduces the cross‑layer semantic gap and degrades QA performance. The authors also note that a single LSA layer suffices, keeping parameter overhead minimal.
Limitations include the fixed group size, which may not be optimal for highly variable input lengths, and the reliance on simple averaging rather than a learned aggregation mechanism. The experiments are limited to inputs up to 32 K tokens, leaving scalability to truly massive contexts (hundreds of thousands of tokens) an open question.
In summary, GMSA introduces a principled way to compress long contexts into soft prompts by (1) uniformly merging token representations to avoid semantic dominance and (2) aligning high‑level summary vectors with the decoder’s low‑level representation space. This dual approach yields better semantic fidelity, higher downstream accuracy, and lower latency, offering a practical solution for deploying LLMs in long‑context applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment