Predicting LLM Output Length via Entropy-Guided Representations
The long-tailed distribution of sequence lengths in LLM serving and reinforcement learning (RL) sampling causes significant computational waste due to excessive padding in batched inference. Existing methods rely on auxiliary models for static length prediction, but they incur high overhead, generalize poorly, and fail in stochastic “one-to-many” sampling scenarios. We introduce a lightweight framework that reuses the main model’s internal hidden states for efficient length prediction. Our framework features two core components: 1) Entropy-Guided Token Pooling (EGTP), which uses on-the-fly activations and token entropy for highly accurate static prediction with negligible cost, and 2) Progressive Length Prediction (PLP), which dynamically estimates the remaining length at each decoding step to handle stochastic generation. To validate our approach, we build and release ForeLen, a comprehensive benchmark with long-sequence, Chain-of-Thought, and RL data. On ForeLen, EGTP achieves state-of-the-art accuracy, reducing MAE by 29.16% over the best baseline. Integrating our methods with a length-aware scheduler yields significant end-to-end throughput gains. Our work provides a new technical and evaluation baseline for efficient LLM inference.
💡 Research Summary
The paper addresses a critical inefficiency in large language model (LLM) serving and reinforcement‑learning (RL) sampling: the “barrel effect” caused by padding shorter sequences to match the longest one in a batch. Existing solutions train lightweight auxiliary predictors that estimate output length from the prompt alone. While such predictors can be useful, they suffer from three major drawbacks. First, they are static: in stochastic “one‑to‑many” generation scenarios—common in RL where a single prompt yields many diverse completions—prompt‑only forecasts become unreliable. Second, they are trained on benchmarks like LMSYS that contain few long‑range or chain‑of‑thought examples, limiting generalization to realistic workloads. Third, they introduce extra inference cost and deployment complexity because each request must invoke a separate model.
To overcome these issues, the authors propose a framework that reuses the LLM’s own hidden states, arguing that the model’s internal representations already encode signals about when the
-
Entropy‑Guided Token Pooling (EGTP) – This module extracts a single vector from the sequence of hidden states produced while encoding the prompt. For each token i, the entropy Hᵢ of the next‑token distribution is computed. The entropies are transformed into attention‑style weights wᵢ via a softmax (optionally temperature‑scaled). The pooled representation h = Σ wᵢ hᵢ therefore emphasizes tokens where the model is most uncertain, which the authors empirically show correlate with gradient‑based importance for length prediction (Pearson r = 0.451). This pooling replaces naïve mean or max pooling and incurs virtually no extra computation because it reuses activations already computed for the forward pass.
-
Soft‑Label Distribution Regression – The continuous length target y is discretized into K bins. Instead of a hard one‑hot label, a soft label p is generated where probability decays exponentially with distance from the true bin. The model predicts a distribution \hat{p} via a softmax head and computes a regression estimate \hat{y} = Σ \hat{p}_i c_i (c_i = bin centre). Training minimizes a weighted sum of cross‑entropy loss (between p and \hat{p}) and mean‑squared error loss (between y and \hat{y}), controlled by a hyperparameter λ. This hybrid loss provides classification‑style stability while preserving distance‑aware regression accuracy, which is crucial for the heavy‑tailed length distributions observed in practice.
-
Progressive Length Prediction (PLP) – To handle stochastic “one‑to‑many” generation, PLP makes a new length estimate at every decoding step t. It concatenates the original prompt representation h with the hidden states of tokens generated so far, forming a dynamic input z_t. The same regression head then predicts the remaining token count y_rem(t). By updating the forecast after each token, PLP adapts to the actual sampling trajectory, making it suitable for RL pipelines where multiple candidate responses are generated from the same prompt.
The authors also introduce ForeLen, a new benchmark that expands beyond LMSYS by including (a) long‑sequence tasks, (b) chain‑of‑thought reasoning, and (c) RL sampling data. ForeLen provides a more realistic testbed for length prediction methods.
Experimental Findings
- Across several LLMs (e.g., LLaMA‑7B, Falcon‑40B), EGTP reduces mean absolute error (MAE) by an average of 29.16 % compared with the strongest prior baseline, and by 55.09 % relative to the widely‑used SSJF‑Reg method.
- When EGTP and PLP are combined with a length‑aware scheduler that builds more homogeneous batches, overall throughput improves dramatically (reported 1.8×–2.3× speed‑ups), because padding overhead is sharply reduced.
- Ablation studies confirm that entropy‑based weighting is the primary driver of EGTP’s superiority; replacing it with uniform or mean pooling degrades performance substantially.
- PLP demonstrates robust performance in RL experiments, where static predictors fail to capture the high variance of response lengths generated from identical prompts.
Significance and Limitations
The work shows that (i) reusing internal activations eliminates the need for separate predictor models, essentially achieving zero‑overhead length estimation; (ii) token entropy is an effective proxy for identifying the most informative parts of a prompt for length forecasting; (iii) progressive, step‑wise prediction enables dynamic adaptation in stochastic generation settings. Limitations include the focus on text‑only LLMs; extending EGTP/PLP to multimodal models or to extremely large models (>100 B parameters) remains an open question. Additionally, real‑world deployment would need integration with existing serving stacks (e.g., vLLM, TensorRT‑LLM) and evaluation of latency‑critical scenarios on edge devices.
Future Directions
- Apply EGTP and PLP to diverse tokenizers, multimodal architectures, and instruction‑tuned models.
- Scale experiments to ultra‑large LLMs to assess whether the entropy‑guided pooling remains effective when hidden states become higher‑dimensional.
- Co‑optimize the length‑aware scheduler with hardware‑aware batch formation policies to further reduce padding‑induced waste.
- Explore on‑device or low‑power implementations, measuring energy savings alongside throughput gains.
In summary, the paper introduces a novel, low‑overhead framework for LLM output‑length prediction that leverages entropy‑guided pooling of internal representations and progressive decoding‑time forecasting. By releasing the ForeLen benchmark and demonstrating substantial accuracy and throughput improvements, the authors set a new technical baseline for efficient LLM inference, especially in stochastic generation contexts such as reinforcement‑learning‑driven alignment.
Comments & Academic Discussion
Loading comments...
Leave a Comment