A note on the impossibility of conditional PAC-efficient reasoning in large language models

November 25, 2025

Reading time: 5 minute

...

📝 Original Info

Title: A note on the impossibility of conditional PAC-efficient reasoning in large language models
ArXiv ID: 2512.03057
Date: 2025-11-25
Authors: Hao Zeng

📝 Abstract

We prove an impossibility result for conditional Probably Approximately Correct (PAC)-efficient reasoning in large language models. While recent work has established marginal PAC efficiency guarantees for composite models that switch between expensive expert models and cheaper fast models, we show that conditional (pointwise) guarantees are impossible in the distribution-free setting. Specifically, for non-atomic input spaces, any algorithm achieving conditional PAC efficiency must be trivial in the sense that it defers to the expert model with probability at least $1-α$ for almost every input.

💡 Deep Analysis

📄 Full Content

Large language models have achieved remarkable progress in complex problemsolving, but suffer from high computational costs during deployment (Kwon et al., 2023). To address this, various approaches have been proposed, including model routing (Ong et al., 2025;Dekoninck et al., 2025), speculative decoding (Leviathan et al., 2023), and adaptive reasoning strategies (Snell et al., 2024). Zeng et al. (2025) proposed PAC reasoning, which constructs a composite model f that selectively switches between an expensive expert model f and a cheaper fast model f while providing statistical guarantees on performance loss. A typical example is the thinking-nonthinking paradigm, where the expert model performs extended chain-of-thought reasoning while the fast model generates direct responses.

The original PAC reasoning provides marginal guarantees, controlling the expected risk over the input distribution. A natural extension is whether we can achieve a stronger, conditional guarantee that controls the risk for each input point individually. This is analogous to the notion of object-conditional validity in conformal prediction (Vovk, 2012;Lei and Wasserman, 2014;Lei et al., 2018). However, Barber et al. (2021) established fundamental limits on distributionfree conditional predictive inference, showing that exact conditional coverage is impossible without distributional assumptions. Similar impossibility results have been explored in the context of conformal risk control (Angelopoulos et al., 2025b;Gibbs et al., 2025).

In this note, we establish an impossibility result for the PAC reasoning setting: conditional PAC efficiency implies triviality. Specifically, any algorithm achieving conditional PAC efficiency must defer to the expert model with probability at least 1 -α for almost every input, providing no efficiency improvement.

Let X denote the input space and Y the output space. We assume data are generated from a joint distribution P over X ×Y, with P X denoting the marginal distribution on X . Given input x ∈ X , the expert model f : X → Y produces y = f (x), while the fast model f : X → Y produces ỹ = f (x). In the calibration dataset D cal = {(x i , y i )} n i=1 ∼ P n (i.e., n independent samples from P ), we set y i = f (x i ) to be the expert model’s output. A router system constructs a composite model via a routing function g : X → {0, 1}:

The pointwise risk at input

, where ℓ : Y × Y → [0, ∞) is a loss function. For example, in the thinking-nonthinking paradigm, we might use the 0-1 loss ℓ(ỹ, y) = 1{ỹ ̸ = y} to measure whether the fast model’s direct response matches the expert model’s reasoning-based output. A simple implementation of the routing function is the single-threshold system introduced by Zeng et al. (2025). Given a score function s : X → R that measures the difficulty or uncertainty of an input, the routing function takes the form

where the threshold τ is chosen based on the calibration dataset to control the marginal risk. Inputs with scores above the threshold are routed to the expert model, while those below are handled by the fast model.

Definition 1 (Marginal PAC efficiency (Zeng et al., 2025)). An algorithm A is (ϵ, α)-marginally PAC efficient if for all distributions P ,

The PAC reasoning algorithm proposed by Zeng et al. (2025) is one approach that achieves marginal PAC efficiency. It constructs a simple router based on learn-then-test framework (Angelopoulos et al., 2025a) applied to the calibration dataset.

Definition 2 (Conditional PAC efficiency). An algorithm A is (ϵ, α)-conditionally PAC efficient if for all distributions P and P X -almost every x ∈ X ,

A trivial approach to achieving conditional PAC efficiency is to always use the expert model, i.e., set g(x) = 1 for all x. This guarantees R( f ; x) = 0 for every input, trivially satisfying the conditional PAC efficiency requirement. However, such an algorithm provides no computational savings, as it never uses the fast model. The key question is whether non-trivial algorithms, those use the fast model with probability greater than α for some inputs, can achieve conditional PAC efficiency.

Theorem 3 (Impossibility). Let X be a non-atomic complete separable metric space. Assume the fast model f has non-trivial loss, i.e., there exists E ⊂ X with P X (E) > 0 such that ℓ( f (x), f (x)) > ϵ for all x ∈ E. Then an algorithm A is (ϵ, α)-conditionally PAC efficient if and only if for all distributions P and P X -almost every x ∈ X ,

Remark 4. This result implies that any algorithm achieving meaningful efficiency gains cannot satisfy conditional PAC efficiency: An algorithm satisfying the condition in the theorem uses the fast model with probability at most α for almost every input, which means it essentially always defers to the expensive expert model and provides no efficiency improvement.

Proof of Theorem 3. (⇒) Suppose for all distributions P and P X -almost every x ∈ X , we have P D cal ∼P n (g(x) = 0) ≤ α. Then for P X -a

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

A note on the impossibility of conditional PAC-efficient reasoning in large language models

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

Physics-informed self-supervised learning for predictive modeling of coronary artery digital twins

Re-Key-Free, Risky-Free: Adaptable Model Usage Control

Start searching

No results found