압축기‑예측기 시스템의 정보이론적 설계와 성능 예측

Reading time: 6 minute
...

📝 Abstract

Agentic language model (LM) systems power modern applications like “Deep Research” and “Claude Code,” and leverage multi-LM architectures to overcome context limitations. Beneath their apparent diversity lies a recurring pattern: smaller “compressor” LMs (that can even run locally) distill raw context into compact text that is then consumed by larger “predictor” LMs. Despite their popularity, the design of compressor-predictor systems remains largely ad hoc, with little guidance on how compressor and predictor choices shape downstream performance. In practice, attributing gains to compression versus prediction requires costly, task-specific pairwise sweeps. We argue that these agentic system design questions are, at root, information-theoretic. Viewing the compressor LM as a noisy channel, we introduce a simple estimator of mutual information between the context and its compression to quantify compression quality in a task-independent way. We show that mutual information strongly predicts downstream performance, independent of any specific task. Through an information-theoretic framework, we perform a comprehensive empirical analysis across five datasets and three model families. Results reveal that larger compressors not only are more accurate, but also more token-efficient, conveying more bits of information per token. A 7B Qwen-2.5 compressor, for instance, is 1.6× more accurate, 4.6× more concise, and conveys 5.5× more bits of mutual information per token than its 1.5B sibling. Across datasets, scaling compressors is substantially more effective than scaling predictors, enabling larger on-device compressors to pair with smaller cloud predictors. Applied to a Deep Research system, these principles enable local compressors as small as 3B parameters to recover 99% of frontier-LM accuracy at 26% of API costs.

💡 Analysis

Agentic language model (LM) systems power modern applications like “Deep Research” and “Claude Code,” and leverage multi-LM architectures to overcome context limitations. Beneath their apparent diversity lies a recurring pattern: smaller “compressor” LMs (that can even run locally) distill raw context into compact text that is then consumed by larger “predictor” LMs. Despite their popularity, the design of compressor-predictor systems remains largely ad hoc, with little guidance on how compressor and predictor choices shape downstream performance. In practice, attributing gains to compression versus prediction requires costly, task-specific pairwise sweeps. We argue that these agentic system design questions are, at root, information-theoretic. Viewing the compressor LM as a noisy channel, we introduce a simple estimator of mutual information between the context and its compression to quantify compression quality in a task-independent way. We show that mutual information strongly predicts downstream performance, independent of any specific task. Through an information-theoretic framework, we perform a comprehensive empirical analysis across five datasets and three model families. Results reveal that larger compressors not only are more accurate, but also more token-efficient, conveying more bits of information per token. A 7B Qwen-2.5 compressor, for instance, is 1.6× more accurate, 4.6× more concise, and conveys 5.5× more bits of mutual information per token than its 1.5B sibling. Across datasets, scaling compressors is substantially more effective than scaling predictors, enabling larger on-device compressors to pair with smaller cloud predictors. Applied to a Deep Research system, these principles enable local compressors as small as 3B parameters to recover 99% of frontier-LM accuracy at 26% of API costs.

📄 Content

Agentic language model (LM) systems have quickly become the backbone of modern AI workflows. From “Deep Research” systems [24] to Claude Code [3], millions of users now interact with pipelines where one model processes information and another builds on its outputs. Modern workflows commonly involve analyzing and generating more tokens than even the largest frontier models can handle effectively, degrading model performance-a failure mode referred to as context rot [28]. Multi-LM systems coordinate multiple models to manage reasoning and memory beyond a single model’s context window. While these architectures vary widely, a recurring pattern emerges across domains: smaller compressor models distill raw contexts into compact texts, which are then consumed by larger predictor models that output an answer and interact with the user (Figure 1) [24,55].

At present, however, designing compressor-predictor agentic systems remains largely trial-and-error. We lack a basic understanding of how the choice of compressor and predictor affects downstream performance. Specifically, we cannot determine whether credit belongs to the compressor’s distillation or the predictor’s reasoning-we lack task-agnostic methods to evaluate the compressor’s outputs independently from downstream performance. This is because we are unable to measure how much of the original context the compressor actually preserves, which in turn determines how effectively the predictor can reason. This attribution problem has immediate practical consequences: as new models are released and practitioners swap components, they have no principled way to identify which module to improve without sweeping across the compound system from scratch.

To address this gap, we take an information-theoretic perspective, viewing the compressor as a noisy channel between the raw data and the predictor model. This framing allows us to evaluate communication between the two models rather than treat it heuristically. We propose using mutual information (MI) between the raw context and its compression as a task-agnostic proxy of compressor efficacy-analogous to how perplexity serves as a task-agnostic proxy of downstream performance [27,32]. We then conduct a rate-distortion analysis to measure how downstream task performance varies with the degree of compression. While it is intractable to calculate MI between two token sequences linked via a nonlinear model, we develop a simple, unbiased estimator that can be computed via modern inference servers without requiring full vocabulary log probabilities.

With this new information-theoretic lens, we perform extensive empirical studies on five datasets (LongHealth [1], FinanceBench [30], QASPER [13], WildChat [71], and FineWeb [48]) to answer the following questions:

  1. Should you spend compute on the compressor or predictor? We find that compressor quality overwhelmingly governs performance: scaling a Qwen-2.5 compressor from 1B to 7B improves accuracy by 60% whereas scaling the predictor from 70B to 405B yields only a 12% improvement on LongHealth. This establishes a simple design principle: “front-load” compute into compressors, perhaps running on-device, to reduce dependence on massive cloud-hosted predictors. (Section 3.1)

  2. Which compressors are more token-efficient? We find that larger compressors emit fewer output tokens while maintaining quality: in many model families, scaling compressor size not only improves accuracy but also produces compressions that are up to 4.6× more concise. This token-efficiency yields sublinear scaling of FLOPs-per-generation as a function of model size. Strikingly, increasing Qwen-2.5 compressor from 1.5B to 7B, only adds 1.3% more FLOPs-per-generation. (Section 3.1)

  3. Which factors determine compression quality and how do they relate to downstream performance? We find that compressors’ outputs carry up to 5.4× more MI about the context (Section 3.2). Rate-distortion analysis reveals that information rate (MI per token) correlates strongly with downstream performance and perplexity (r = -0.84, R 2 = 0.71), providing a practical proxy for predicting system performance without full end-to-end evaluation. (Section 3.3)

  4. With so many knobs to turn, which factors should you focus on for agentic system design? We perform a meta-analysis across model families, sizes, and datasets, exposing a clear hierarchy of importance: compressor model family > compressor size > predictor size. (Section 3.4)

As a practical demonstration, we apply our findings to a simplified Deep Research pipeline, where a single predictor aggregates outputs from multiple compressors. This system achieves 99% of frontier-LM accuracy on the DeepResearch Bench benchmark [14] using local compressor models as small as 3B, reducing API costs by 74% (Section 3.5).

Agents and Multi-Agent Systems We define AI agents as LMs that are embedded in a context and are able to reason, plan, and act through tool use [5,57,63,67]. The a

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut