ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure

ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large reasoning models (LRMs) typically solve reasoning-intensive tasks by generating long chain-of-thought (CoT) traces, leading to substantial inference overhead. We identify a reproducible inference-time phenomenon, termed Self-Compression: when multiple independent and answerable questions are presented within a single prompt, the model spontaneously produces shorter reasoning traces for each question. This phenomenon arises from multi-question contextual pressure during generation and consistently manifests across models and benchmarks. Building on this observation, we propose ConPress (Learning from Contextual Pressure), a lightweight self-supervised fine-tuning approach. ConPress constructs multi-question prompts to induce self-compression, samples the resulting model outputs, and parses and filters per-question traces to obtain concise yet correct reasoning trajectories. These trajectories are directly used for supervised fine-tuning, internalizing compressed reasoning behavior in single-question settings without external teachers, manual pruning, or reinforcement learning. With only 8k fine-tuning examples, ConPress reduces reasoning token usage by 59% on MATH500 and 33% on AIME25, while maintaining competitive accuracy.


💡 Research Summary

The paper “ConPress: Learning Efficient Reasoning from Multi‑Question Contextual Pressure” investigates a previously unnoticed inference‑time behavior of large reasoning models (LRMs). When a prompt contains several independent, answerable questions, the model automatically generates shorter chain‑of‑thought (CoT) traces for each question compared with a single‑question prompt. The authors call this phenomenon Self‑Compression and attribute it to Contextual Pressure: the need to complete multiple reasoning processes within a single generation forces the model to adopt more concise continuations.

Empirical Characterization
The authors conduct systematic experiments on two state‑of‑the‑art models—DeepSeek‑R1‑Distill‑Qwen‑7B and Qwen3‑4B‑Thinking—using the MATH dataset and several olympiad‑style benchmarks. They vary the number of questions N (1 ≤ N ≤ 8) while keeping the target question fixed. Results show a monotonic reduction in per‑question reasoning length as N grows: compression ratios rise from ~48 % at N = 2 to >65 % at N = 8. Importantly, this effect is robust across model families and is not reproduced by alternative prompt modifications that do not add a second question (e.g., inserting a declarative statement, an empty placeholder, or a short instruction). Even a trivial auxiliary question such as “1 + 1 = ?” yields a substantial compression, and the difficulty of the extra question only modestly influences the compression rate. Accuracy degrades gradually with larger N, but the drop is modest for the more robust Qwen3‑4B‑Thinking model.

ConPress Method
Motivated by the observation that multi‑question contexts naturally elicit concise reasoning, the authors propose ConPress, a lightweight self‑supervised fine‑tuning pipeline:

  1. Multi‑question sampling – Randomly pack N independent questions into a single prompt using a neutral “Question i:” delimiter.
  2. Model generation – Sample the LRM to obtain per‑question reasoning traces r_i and predicted answers ŏ_i.
  3. Rejection filtering – Keep only those traces where ŏ_i matches the ground‑truth answer o_i, discarding incorrect or malformed outputs. The retained triples (q_i, r_i, o_i) form a self‑generated dataset D_CP.
  4. Single‑question transfer – Fine‑tune the same model on D_CP using standard supervised learning (token‑level negative log‑likelihood) so that, when presented with a single question, it reproduces the compressed trace r_i.

Thus, ConPress converts a phenomenon that only appears under multi‑question pressure into a training signal that improves single‑question inference without any external teacher, manual pruning, or reinforcement‑learning (RL) reward design.

Results
Using only ~8 k self‑generated examples, ConPress achieves dramatic token savings while preserving accuracy:

Model Benchmark Token reduction Accuracy change
Qwen3‑4B‑Thinking MATH500 –48.7 % –0.6 %
Qwen3‑4B‑Thinking AIME25 –33.9 % –0.6 %
R1‑Distill‑Qwen‑7B MATH500 –45.4 % –0.0 %
R1‑Distill‑Qwen‑7B AIME25 –36.3 % –1.2 %

Compared with recent baselines that rely on RL‑based token penalties (DPO shortest) or teacher‑driven distillation (RFT shortest), ConPress attains comparable or superior efficiency‑accuracy trade‑offs while requiring far less engineering effort. Notably, the method works without any explicit length constraints, reward shaping, or external data curation.

Implications and Limitations
The work reveals that LRM generation dynamics are highly sensitive to prompt structure, suggesting new avenues for controlling model behavior through “contextual pressure” rather than external supervision. However, the compression comes at the cost of a modest accuracy dip as N grows, indicating a trade‑off that must be managed in production settings. The study focuses on mathematical reasoning; extending the approach to code generation, commonsense reasoning, or multimodal tasks remains an open question. Moreover, a quantitative metric for contextual pressure could enable adaptive control of compression strength.

Conclusion
ConPress demonstrates that a simple, self‑supervised fine‑tuning loop can harness an intrinsic model property—self‑compression under multi‑question prompts—to produce substantially more token‑efficient reasoning. By eliminating the need for teacher models, handcrafted pruning, or costly RL pipelines, it offers a practical, scalable route to reduce inference costs for large reasoning models while maintaining high performance. Future work may explore formalizing contextual pressure, applying the technique to broader domains, and integrating dynamic pressure modulation into interactive AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment