Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

BLACK-BOX RELIABILITY CER TIFICA TION FOR AI AGENTS VIA SELF-CONSISTENCY SAMPLING AND CONFORMAL CALIBRA TION CHARAFEDDINE MOUZOUNI OPIT – Open Institute of T echnology , and Cohorte AI, Paris, France. c harafeddine@cohorte.co A B S T R A C T . Given a black-box AI system and a task, at what conﬁdence level can a prac- titioner trust the system’s output? W e answer with a reliability level —a single number per system–task pair , derived from self-consistency sampling and conformal calibration , that serves as a black-box deployment gate with exact, ﬁnite-sample, distribution-fr ee guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guar- antees correctness within 1 / ( n + 1 ) of the target level, regardless of the system’s errors— made transpar ently visible through larger answer sets for harder questions. W eaker models earn lower r eliability levels (not accuracy—see Deﬁnition 2.4 ): GPT -4.1 earns 94.6% on GSM8K and 96.8% on T ruthfulQA, while GPT -4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. W e validate across ﬁve benchmarks, ﬁve models fr om thr ee families, and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 across all conﬁgurations; sequential stopping reduces API costs by ∼ 50% . 1. I N T R O D U C T I O N Y ou have an AI system and a task—say , answering math questions or triaging support tickets. Before you deploy it, you need to know: how much can I trust this system? Not a vague intuition, but a concrete number with a guarantee attached. That is what this paper provides. The idea is simple (Figure 1 ). For each test question, ask the AI system the same ques- tion K times (say , K = 10 ). Group the answers that say the same thing and rank them by frequency: the most popular answer might appear 8 out of 10 times, the runner-up 2 out of 10 . This frequency ranking is the raw signal—it captures how “sur e” the system is without ever looking inside its weights. Then, a human spot-checks a small random batch—just 50 − 100 items. Each check takes seconds: the human sees the question and the system’s top-ranked answer , and marks it right or wrong. From these quick judgments, the framework computes a single number: the reliability level . For instance, GPT -4.1 earns 94.6% reliability on grade-school math. The reliability level comes with a formal guarantee: it is a valid coverage bound that holds regar dless of the AI system’s systematic errors, and it requir es no assumptions about the data distribution. Crucially , the human is verifying, not labeling: the AI gener - ates and ranks the answers; the human just conﬁrms or r ejects the top pick. There is no need to build a gold-standar d dataset—ﬁfty quick judgments ar e the entir e human ef fort. A weaker system earns a lower number , not a misleading score; the framework makes trustworthiness transpar ent . Date : February 26, 2026. Key words and phrases. reliability certiﬁcation, conformal prediction, self-consistency , LLM evaluation, deployment gating, uncertainty quantiﬁcation. 1 2 CHARAFEDDINE MOUZOUNI AI System Question → K = 10 calls 42 42 42 37 42 37 42 42 42 42 1. Sample Ask the same question K times Group & count “42” → 8 / 10 “37” → 2 / 10 Rank ed b y frequency 2. Rank Sort identical answers by frequency Human chec ks n ≈ 50 items ✓ item 1: top 1 correct ✓ item 2: top 1 correct × item 3: need top 2 − → 94.6 % 3. Calibrate Small batch → reliability level F I G U R E 1 . Pipeline overview . Step 1 : ask the AI system the same question K times and collect its answers. Step 2 : group identical answers and rank them by frequency . Step 3 : a human checks a small calibration batch; the framework outputs a single reliability level (e.g. 94.6% ) with a formal coverage guarantee. No model internals are needed—only API access. More precisely , current evaluation methods each fail in a different way . Single-sample evaluation is unbiased but high-variance—a noisy snapshot of the agent’s true capabil- ity . Naive self-consistency (mode selection) reduces variance but can amplify bias: when the agent’s most frequent answer is wr ong, more samples make the wr ong answer look more certain. LLM-as-judge approaches [ 8 ] layer poorly characterized biases—position bias, verbosity pr efer ence, self-pr eference—on top of the agent’s own err ors, and provide no formal reliability guarantees. None of these methods simultaneously controls both variance and bias. W e introduce the reliability level (Deﬁnition 2.4 )—a single number per agent–task pair that answers this question with pr ovable guarantees. The r eliability level is a black-box deployment gate: it requir es only API access, no model internals, and its validity is distribution-free with ﬁnite-sample coverage guarantees. Concretely , GPT -4.1 achieves 94.6% reliability on GSM8K and 96.8% on T ruthfulQA; GPT -4.1-nano achieves 89.8% on GSM8K and 66.5% on MMLU. Open-weight models (Llama 4 Maverick, Mistral Small 24B) range from 66.7% (MMLU) to 95.4% (GSM8K) across three benchmarks (T a- ble 10 ). A practitioner can read these numbers directly: “we need X % reliability—which models qualify?” The mechanism behind the reliability level combines two ingredients. Self-consistency sampling reduces variance exponentially via aggregation (Theor em 4.4 ). Conformal cal- ibration provides coverage guarantees whose validity is independent of agent bias : the evaluation’s coverage statement is correct regardless of the agent’s systematic error pro- ﬁle (Theor em 7.3 ). Agent bias is not hidden—it is made transparently visible through larger prediction sets (Theorem 7.5 ). A weaker agent earns a lower reliability level, not a mis- leading score. A key consequence—and a diagnostic that distinguishes calibration failur e from agent limitation—is that marginal coverage can fall below 1 − α only when the agent cannot solve certain items at all. This under-coverage is predicted by the theory , not a calibration artifact: conditional coverage on items the model can solve r emains near-perfect ( ⩾ 0.93 RELIABILITY CER TIFICA TION FOR AI AGENTS 3 across all models and benchmarks in our experiments). When observed coverage falls short, the framework identiﬁes why : the gap equals the fraction of unsolvable items. Contributions. Self-consistency sampling [ 1 ] and conformal prediction [ 2 , 3 ] are individ- ually known. Recent work [ 10 , 17 ] applies conformal methods to language models using internal logits or softmax pr obabilities. Our contribution is a speciﬁc synthesis: a confor - mal score built entirely from external sample frequencies, producing a single reliability number for deployment gating—with no access to model internals. Speciﬁcally: (1) Reliability certiﬁcation for deployment gating (Deﬁnition 2.4 , T able 10 ): the primary practical output—a single reliability level per agent–task pair that answers “at what conﬁdence can I trust this agent?” V alidated across ﬁve benchmarks, ﬁve models fr om three families (GPT -4.1 ladder , Llama 4 Maverick, Mistral Small), with reliability levels ranging from 66.5% to 96.8% . (2) A black-box nonconformity score from ranked canonical consensus (Sections 4 – 6 ): the technical construction enabling the reliability level—conformal scores from the rank of the acceptable answer in the self-consistency ordering, leveraging the variance reduction of aggregation while inheriting the distribution-free validity of conformal prediction. This speciﬁc construction and its analysis (variance reduction theorems, canonicalization-induced ampliﬁcation) are new . (3) Reliability theorems (Section 7 ): the evaluation’s coverage err or is bounded by 1 / ( n + 1 ) regar dless of the agent’s bias proﬁle (Theorem 7.3 ); pr ediction set size is a mono- tone, transpar ent diagnostic of agent quality (Theorem 7.5 ); and the method achieves lower coverage error than LLM-as-judge evaluation once the calibration set exceeds ⌈ 1 / | b J | ⌉ − 1 (e.g., n ⩾ 19 for typical judge bias; Corollary 7.10 ), though the two ap- proaches answer complementary questions (Remark 7.11 ). (4) Bias–variance anatomy of LLM evaluation (Section 3 ): the motivating analysis—a formal decomposition showing that single-sample evaluation suffers from variance, LLM-as-judge from irreducible bias, and naive self-consistency from bias ampliﬁca- tion. (5) Sequential sampling with certiﬁed early stopping (Section 9 ): a Hoeffding-based stopping rule that r educes API cost by ∼ 50% with no loss in coverage. Related W ork. Self-consistency decoding. W ang et al. [ 1 ] introduced self-consistency as an inference- time strategy: sample multiple reasoning traces and select the most frequent ﬁnal an- swer . W e repurpose self-consistency for evaluation : the ranked consensus provides raw material for conformal calibration. Corder o-Encinar and Duncan [ 9 ] certify when the majority-vote answer has stabilized; our stopping criterion (Theorem 9.1 ) certiﬁes the ranking quality of the top- M candidates for conformal set construction. The two are com- plementary . Conformal prediction and language models. Conformal pr ediction [ 2 , 3 ] pr ovides predic- tion sets that are distribution-fr ee with ﬁnite-sample coverage guarantees. Recent work applies conformal methods to language models: Quach et al. [ 10 ] construct conformal sets over token sequences for language generation; Kumar et al. [ 17 ] apply conformal prediction to NLP classiﬁcation tasks. These methods deﬁne nonconformity scores from model logits or softmax pr obabilities—internal quantities unavailable for black-box API- based agents. Our score is constr ucted entir ely from external ranked consensus over mul- tiple samples, r equiring no access to model internals. This black-box property is essential for evaluating proprietary LLM APIs. 4 CHARAFEDDINE MOUZOUNI Method comparison. Quach et al. and Kumar et al. achieve tighter pr ediction sets when softmax probabilities are available, because internal scores carry richer information than sample frequencies. Our contribution is generality : the framework applies to any black- box agent accessible only thr ough sampling—commer cial APIs, tool-using agents, multi- step pipelines—and to any task where canonicalization is feasible, including open-ended generation where token-level conformal methods do not apply . LLM-as-judge and evaluation bias. Zheng et al. [ 8 ] established the LLM-as-judge para- digm. Subsequent studies documented systematic biases: position bias, verbosity bias, self-prefer ence, and anchoring ef fects [ 8 ]. Our bias–variance analysis (Section 3 ) formal- izes these observations: judge bias is an irreducible MSE component that does not decay with sample size (Proposition 3.4 ). Our framework avoids judge bias entirely by using frequency-based ranking rather than quality scoring. Uncertainty quantiﬁcation for LLMs. SelfCheckGPT [ 15 ] detects hallucinations via inter- sample agreement. Semantic entropy [ 18 ] clusters generations and computes entropy over semantic classes. ConU [ 16 ] applies conformal prediction using token-level prob- abilities. Although some APIs expose logprobs, ConU’s conformal guarantee requir es calibrated token-level probabilities—a condition that is unveriﬁable for proprietary sys- tems and unr eliable even when logpr obs ar e nominally available. Our rank-based non- conformity score is the ﬁrst to provide formal conformal guarantees without any model internals, enabling calibration for any system accessible only thr ough sampling. All three prior methods produce uncertainty estimates without formal calibration guarantees. Our framework produces calibrated prediction sets with ﬁnite-sample coverage guarantees, and uniquely pairs this black-box scor e with a deployment-gating output—the rel iability level—a single actionable number that no prior conformal-LLM method provides. Se- mantic entropy could serve as a complementary pr e-ﬁlter within conformal calibration. Calibration of pr obabilistic predictions. Classical calibration methods (Platt scaling, tem- perature scaling, isotonic regression) adjust model conﬁdence scor es to match empirical frequencies. These require access to model probabilities and assume a ﬁxed model; they cannot handle open-ended generation where the output space is combinatorial. Confor- mal prediction, by contrast, is model-agnostic and makes no distributional assumptions, which is why we adopt it as the calibration backbone. 2. P R O B L E M S E T T I N G W e formalize the setting: an AI system receives a query , produces a stochastic an- swer , and a task-speciﬁc predicate decides whether that answer is acceptable. The goal is to certify , from a small calibration set, how reliably the system’s top-ranked answer is acceptable. Our experiments validate the framework on single-turn query-answering systems; extensions to multi-turn interactions are discussed in Section 12 . 2.1. Queries, answers, and acceptability. Let X denote the space of queries and A the space of possible answers. An AI system with parameters θ deﬁnes a stochastic mapping f θ : X → A , f θ ( x ) ∼ P θ ( · | x ) , where P θ ( · | x ) is the system’s output distribution conditional on query x , accessible via repeated sampling. Deﬁnition 2.1 (Acceptability) . For a query x ∈ X , the set of acceptable answers is A ⋆ ( x ) := { a ∈ A : Accept ( x , a ) = 1 } , RELIABILITY CER TIFICA TION FOR AI AGENTS 5 where Accept : X × A → { 0, 1 } is a task-dependent predicate. W e allow |A ⋆ ( x ) | ⩾ 1 . For example, on a math problem with answer 42 , Accept returns 1 for “ 42 ”, “ 42.0 ”, and “forty-two”, and 0 for “ 43 ”. Assumption 2.2 (Non-triviality) . For each query x in the target distribution, A ⋆ ( x )  = ∅ . 2.2. Per -query acceptability rate and agent quality . The central quantity is the probabil- ity that a single random sample from the agent is acceptable—intuitively , how often the agent “gets it right” on a given question. Deﬁnition 2.3 (Per-query acceptability rate) . For a ﬁxed query x , the agent’s acceptability rate is (2.1) p ⋆ ( x ) := P θ  f θ ( x ) ∈ A ⋆ ( x )  = X a ∈ A ⋆ ( x ) P θ ( a | x ) . The aggregate accuracy of the agent over a query distribution µ on X is (2.2) ¯ p := E X ∼ µ [ p ⋆ ( X )] . The quantities p ⋆ ( x ) and ¯ p are the gr ound truth that any evaluation method seeks to estimate or certify . Our framework does not estimate ¯ p directly; instead, it provides a coverage guarantee that is a stronger , more actionable statement about r eliability . 2.3. The evaluation goal. W e seek a pr ocedure that, for each query x , r eturns a pr ediction set S ( x ) ⊂ C of canonical answer classes such that (2.3) P  Y ( x ) ∈ S ( x )  ⩾ 1 − α , where Y ( x ) denotes an acceptable answer for x and α ∈ ( 0, 1 ) is a user -speciﬁed miscov- erage level. The set S ( x ) should be as small as possible: a smaller set indicates a more reliable agent on that query . W e now deﬁne the target quantity that summarizes this coverage guarantee into a single deployment metric: the highest conﬁdence level at which the agent’s most frequent answer is trustworthy . Deﬁnition 2.4 (Reliability level) . For a given agent and task, the reliability level is (2.4) 1 − α ⋆ := |{ i ∈ { 1, . . . , n } : s i ⩽ 1 }| n + 1 , where s 1 , . . . , s n are calibration nonconformity scor es (formally deﬁned in Section 6 ; intu- itively , s i is the rank of the correct answer among the model’s self-consistency candidates for item i ). Equivalently , 1 − α ⋆ is the maximum conﬁdence at which the self-consistency mode alone provides conformal coverage. Remark 2.5 (Interpreting the reliability level) . The numerator of ( 2.4 ) counts calibration items where the self-consistency mode is correct, closely related to mode accuracy but with the conformal correction n + 1 in the denominator that ensures a valid conformal quantile. Rather than asking “does this model achieve coverage at a ﬁxed α ?”, we invert: “at what conﬁdence level does this model qualify?” By Theor em 7.1 , the r eliability level is a lower bound on test-time mode-voting coverage: deploying with α = α ⋆ guarantees P ( Y ∈ S ( x )) ⩾ 1 − α ⋆ . The gap between the reliability level and empirical test coverage is at most 1 / ( n + 1 ) (e.g., ⩽ 0.002 for n = 500 ). 6 CHARAFEDDINE MOUZOUNI 3. B I A S A N D V A R I A N C E I N L L M E V A L U AT I O N Before presenting our solution, we develop a formal framework for understanding why LLM evaluation is unreliable, and what “reliable evaluation” requir es mathemati- cally . This analysis motivates every design choice in our framework. 3.1. What does reliable evaluation require? There ar e three ways to assess an answer: declare it correct or not (binary), assign a quality score (continuous), or output a set of candidates guaranteed to contain the truth (set-valued). Each has a dif ferent err or pr oﬁle. Deﬁnition 3.1 (Evaluation method) . An evaluation method E is a procedure that, given a query x and access to the agent f θ , produces an assessment. W e consider thr ee types: (1) Point assessment: E ( x ) ∈ { 0, 1 } (correct/incorrect). (2) Score assessment: E ( x ) ∈ [ 0, 1 ] (quality score). (3) Set assessment: E ( x ) = S ( x ) ⊂ C (prediction set with coverage guarantee). For point and score assessments, reliability is measured by the mean squared error between the assessment and the true acceptability: (3.1) MSE ( E ) = E  ( E ( X ) − p ⋆ ( X )) 2  . The classical bias–variance decomposition applies: (3.2) MSE ( E ) = E  ( E [ E ( X ) | X ] − p ⋆ ( X )) 2  | {z } Bias 2 + E  V ar ( E ( X ) | X )  | {z } V ariance . For set assessments, reliability is instead captur ed by the coverage validity gap : (3.3) Gap ( E ) :=   P ( Y ∈ S ( X )) − ( 1 − α )   . A method with Gap ( E ) = 0 achieves exact calibration. 3.2. Error anatomy of current evaluation methods. W e now analyze the bias and vari- ance of three standard approaches, establishing the precise deﬁciencies that our frame- work addresses. 3.2.1. Single-sample evaluation. The simplest evaluation draws one sample a ∼ P θ ( · | x ) and checks acceptability: E 1 ( x ) := 1 { a ∈ A ⋆ ( x ) } , a ∼ P θ ( · | x ) . Proposition 3.2 (Bias–variance of single-sample evaluation) . The single-sample evaluator satisﬁes: Bias ( E 1 ( x )) = 0, (3.4) V ar ( E 1 ( x ) | x ) = p ⋆ ( x )( 1 − p ⋆ ( x )) . (3.5) The variance is maximized at p ⋆ ( x ) = 1 / 2 (the hardest queries) and equals 1 / 4 . Proof. E [ E 1 ( x ) | x ] = P θ ( a ∈ A ⋆ ( x )) = p ⋆ ( x ) , so the bias is zero. The variance follows from V ar ( Bernoulli ( p )) = p ( 1 − p ) . □ Remark 3.3 (Unbiased but unr eliable) . Single-sample evaluation is unbiased but has max- imum variance exactly where it matters most: on queries where the agent is uncertain ( p ⋆ ( x ) ≈ 1 / 2 ). For a single query , the evaluation is essentially a coin ﬂip when the agent is mediocre. This variance does not decr ease without additional samples. RELIABILITY CER TIFICA TION FOR AI AGENTS 7 3.2.2. LLM-as-judge evaluation. An LLM judge J scores the agent’s output: E J ( x ) := J ( x , a ) , a ∼ P θ ( · | x ) , where J : X × A → [ 0, 1 ] is stochastic (the judge itself has sampling variance). Proposition 3.4 (Bias–variance of LLM-as-judge evaluation) . Deﬁne the judge’ s systematic bias on query x with answer a as (3.6) b J ( x , a ) := E [ J ( x , a )] − Accept ( x , a ) . Then: Bias 2 ( E J ( x )) =  E a [ b J ( x , a )]  2 , (3.7) V ar ( E J ( x ) | x ) = V ar a ( Accept ( x , a )) | {z } agent variance + E a [ V ar ( J ( x , a ) | x , a )] | {z } judge variance + V ar a ( b J ( x , a )) | {z } bias variance . (3.8) Proof. W rite E J ( x ) = Accept ( x , a ) + b J ( x , a ) + η J ( x , a ) , where η J is the zer o-mean judge noise. By the law of total variance: E [ E J ( x ) | x ] = E a [ Accept ( x , a ) + b J ( x , a )] = p ⋆ ( x ) + E a [ b J ( x , a )] . Hence Bias ( E J ( x )) = E a [ b J ( x , a )] , giving ( 3.7 ). The variance decomposes by conditioning on a and using independence of η J from the other terms. □ Remark 3.5 (Judge bias is irreducible) . The critical ﬂaw of LLM-as-judge is that the bias term E a [ b J ( x , a )] does not decrease with more judge calls or more agent samples. If the judge systematically overrates verbose answers or underrates unconventional but correct solutions, this error persists regardless of sample size. Furthermore, b J is unknown and difﬁcult to estimate without ground-truth labels—the very thing evaluation is meant to replace. 3.2.3. Naive self-consistency (mode selection). Draw K samples, canonicalize, and check whether the most frequent answer is acceptable: E SC ( x ) := 1 { c ( 1 ) ∈ Canon ( x , A ⋆ ( x )) } . Proposition 3.6 (Bias–variance of mode selection) . Let p := p ⋆ ( x ) be the acceptability rate under canonicalization. If p > 1 / 2 : Bias ( E SC ( x )) = 0, (3.9) V ar ( E SC ( x ) | x ) ⩽ exp  − 2 K  p − 1 2  2  . (3.10) If p ⩽ 1 / 2 (systematic bias): Bias ( E SC ( x )) → − p as K → ∞ , (3.11) V ar ( E SC ( x ) | x ) → 0 as K → ∞ . (3.12) Proof. When p > 1 / 2 , the mode is the correct class with probability ⩾ 1 − exp (− 2 K ( p − 1 / 2 ) 2 ) by Hoeffding’s inequality [ 13 ], giving zero asymptotic bias and exponentially de- caying variance. When p ⩽ 1 / 2 , the total mass on incorrect canonical classes is 1 − p ⩾ 1 / 2 . By the law of lar ge numbers, the empirical frequency of each class converges to its true probability as K → ∞ . Let q max := max c / ∈ Canon ( x , A ⋆ ( x )) P θ ( c | x ) be the mass of the most probable incorrect class. If q max > p , the mode is this incorrect class a.s. If q max ⩽ p , then p ⩽ 1 / 2 8 CHARAFEDDINE MOUZOUNI T A B L E 1 . Bias–variance pr oﬁle of evaluation methods (per-query). K = number of samples, n = calibration set size. Method Bias V ariance Error → 0 ? Guarantee Single sample 0 p ⋆ ( 1 − p ⋆ ) No — LLM-as-judge E a [ b J ]  = 0 (irr educible) Prop. 3.4 † No (bias persists) — Self-consistency (mode) 0 if p ⋆ > 1 2 ; → − p ⋆ if p ⋆ ⩽ 1 2 ⩽ exp (− 2 K ( p ⋆ − 1 2 ) 2 ) No (bias if p ⋆ ⩽ 1 2 ) — Ours (conformal) Coverage gap ⩽ 1 / ( n + 1 ) Y es ( → 0 as n → ∞ ) P ( Y ∈ S ) ⩾ 1 − α † Three-term decomposition: agent variance + judge variance + bias variance; does not vanish with more samples. implies the correct class shares the maximum with at least one incorr ect class; by our tie- breaking convention (uniform random), the mode is incorrect with positive probability , and by symmetry as K → ∞ it is incorr ect a.s. whenever the correct class is not the unique mode. In either case, E SC ( x ) → 0 a.s. while p ⋆ ( x ) = p > 0 , yielding Bias → − p . □ Remark 3.7 (The self-consistency double-edged sword) . Proposition 3.6 reveals a funda- mental asymmetry: self-consistency is excellent for variance reduction when the agent is mostly corr ect ( p > 1 / 2 ), but it ampliﬁes conﬁdence in errors when the agent is systemati- cally wrong ( p ⩽ 1 / 2 ). More samples make the wrong answer look more certain. This is the “stable hallucination” phenomenon. Any framework that uses self-consistency must account for this failure mode. Remark 3.8 (The i.i.d. sampling assumption) . Pr oposition 3.6 and Theorem 4.4 assume a 1 , . . . , a K i.i.d. ∼ P θ ( · | x ) . This is well-justiﬁed when each sample is an independent, stateless API call to a ﬁxed model at temperature T > 0 with deterministic canonicaliza- tion (e.g., code execution, string normalization). In practice, positive correlation ρ > 0 between samples can arise fr om model-version drift during data collection, stochastic canonicalization (e.g., an LLM judge), or infrastructure-level batching effects. Under ρ - correlated samples the effective sample size drops to K eﬀ ≈ K/ ( 1 + ( K − 1 ) ρ ) , and the Hoeffding bound degrades to exp (− 2 K eﬀ ( p − 1 2 ) 2 ) . The qualitative conclusion—more samples reduce variance—survives, but at a slower rate. Crucially , the conformal cover- age guarantee (Theorem 6.7 ) depends only on exchangeability of the calibration and test data, not on i.i.d. agent samples, and is therefor e unaffected. 3.3. The bias–variance landscape: a summary . The key observation from T able 1 is that no existing method simultaneously contr ols bias and variance while providing a formal guarantee. Our framework achieves this by (i) using self-consistency for variance reduc- tion, (ii) outputting a set rather than a point to absorb residual bias, and (iii) calibrating the set via conformal prediction to obtain an exact, distribution-fr ee coverage guarantee. 4. S E L F - C O N S I S T E N C Y S A M P L I N G A N D C A N O N I C A L I Z AT I O N 4.1. Self-consistency sampling. For a ﬁxed query x ∈ X , we draw K independent sam- ples from the agent: (4.1) a 1 , a 2 , . . . , a K i.i.d. ∼ P θ ( · | x ) . The empirical distribution over raw answers is ˆ P K ( a | x ) = 1 K K X i = 1 1 { a i = a } . RELIABILITY CER TIFICA TION FOR AI AGENTS 9 4.2. Canonicalization. Different surface forms can express the same answer —“42,” “42.0,” and “The answer is 42” all mean the same thing. Canonicalization maps these to a single repr esentative so that frequency counts r eﬂect semantic agreement, not superﬁ- cial variation. Deﬁnition 4.1 (Canonicalization) . A canonicalization function is a mapping Canon : X × A → C , where C is a space of canonical r epresentations. W e requir e that Canon respects semantic equivalence: if a ≃ a ′ semantically , then Canon ( x , a ) = Canon ( x , a ′ ) . Applying canonicalization yields the empirical canonical distribution: (4.2) ˆ P K ( c | x ) = 1 K K X i = 1 1 { Canon ( x , a i ) = c } . 4.3. Canonicalization: implementation and stability. Since the quality of the consensus vote depends directly on the quality of Canon , we treat canonicalization as a ﬁrst-class algorithmic component. Three r egimes arise in practice: (1) Deterministic canonicalization for structur ed answers—parse “42.0” and “42” to the same integer , match option letters in MCQ tasks, or execute code and recor d pass/fail. This eliminates surface-form variation entirely . (2) Embedding-based clustering for open-ended tasks, using cosine similarity thresholds on dense embeddings to group semantically equivalent r esponses. (3) LLM-assisted canonicalization , where a lightweight model (e.g., GPT -4.1 at temperatur e 0 ) classiﬁes each response as corr ect or incorrect before clustering. Each approach involves speciﬁc failure modes (over-mer ging vs. under -mer ging). Imple- mentation details, stability diagnostics, and empirical requir ements are in Appendix A . Assumption 4.2 (Canonicalization quality) . Throughout the theoretical analysis (Sec- tions 4.4 – 7 ), we assume that Canon correctly maps semantically equivalent answers to the same canonical class. When this assumption is violated, the variance reduction guar- antees (Theorem 4.4 ) degrade gracefully: under-mer ging reduces p canon , increasing the requir ed K , while the conformal coverage guarantee (Theorem 6.7 ) remains valid regar d- less (set sizes simply grow to compensate). Remark 4.3 (LLM-based canonicalization and circularity) . When canonicalization uses an LLM judge (regime (iii) above), a potential circularity arises: if the same model family serves as both agent and canonicalizer , systematic biases could propagate. For instance, the judge might systematically group incorrect answers into a “correct” cluster due to self-prefer ence or verbosity bias, corrupting the consensus vote. Thr ee considerations mitigate this concern: (a) Functional separation. In our experiments, the canonicalizer (GPT -4.1 at temper- ature 0 , producing deterministic binary labels) operates in a fundamentally different regime from the agent (GPT -4.1 at temperatur e 0.7 , generating stochastic free-form re- sponses). At T = 0 , the judge is a ﬁxed deterministic function—analogous to code exe- cution or regex matching—not a stochastic evaluator . The bias proﬁles of deterministic classiﬁcation and stochastic generation are distinct. (b) Coverage guarantee is robust to canonicalization errors. Assumption 4.2 states the key safeguard: if the canonicalizer makes errors (mer ging incorrect answers with correct 10 CHARAFEDDINE MOUZOUNI ones, or splitting correct answers), the conformal coverage guarantee (Theorem 6.7 ) re- mains valid. Canonicalization errors manifest as inﬂated prediction set sizes , not as invalid coverage—the framework honestly reﬂects that canonicalization is noisy by returning larger sets. (c) Cross-family validation breaks the self-preference loop. Our open-weight exper- iments pr ovide a direct test: Llama 4 Maverick and Mistral Small 24B ar e canonicalized by GPT -4.1—a different model family with dif fer ent training data and biases. If within- family self-prefer ence were corrupting canonicalization, cross-family results would di- verge. They do not: coverage ( ⩾ 0.960 ) and conditional coverage ( ⩾ 0.949 ) ar e consistent across all thr ee families (T able 3 ). For maximal rigor , we r ecommend deterministic canonicalization (code execution, nu- meric extraction, option matching) wherever feasible, r eserving LLM-based canonicaliza- tion for tasks wher e no alternative exists. Three of our ﬁve benchmarks use deterministic canonicalization. 4.4. V ariance reduction via consensus aggregation. W e now prove that self-consistency sampling achieves exponential variance reduction for identifying the corr ect answer , and that canonicalization further ampliﬁes this effect. Theorem 4.4 (Exponential variance reduction via consensus) . Let c ⋆ ∈ C be the unique acceptable canonical class for query x , and let p := P θ ( Canon ( x , f θ ( x )) = c ⋆ ) > 1 / 2 . Then after K i.i.d. samples: (1) Mode correctness: (4.3) P ( c ( 1 )  = c ⋆ ) ⩽ exp  − 2 K  p − 1 2  2  . (2) Rank concentration: For any r ⩾ 2 , (4.4) P ( rank ( c ⋆ ; x ) ⩾ r ) ⩽  |C| r − 1  exp  − K ( p − ( 1 − p ) / ( r − 1 )) 2 2  , where |C| is the number of distinct canonical classes and the bound is nontrivial when p > ( 1 − p ) / ( r − 1 ) . (3) V ariance decay: (4.5) V ar  1 { c ( 1 ) = c ⋆ }  ⩽ exp  − 2 K  p − 1 2  2  . Proof. Part 1. The correct class c ⋆ has count N ⋆ = P K i = 1 1 { c i = c ⋆ } ∼ Bin ( K , p ) . The mode is incorrect only if some other class has count ⩾ N ⋆ . Since the total count of all incorr ect classes is K − N ⋆ , a necessary condition is K − N ⋆ ⩾ N ⋆ , i.e., N ⋆ ⩽ K/ 2 . By Hoeffding’s inequality: P ( N ⋆ ⩽ K/ 2 ) = P  N ⋆ K ⩽ 1 2  ⩽ exp − 2 K  p − 1 2  2 ! . Part 2. rank ( c ⋆ ) ⩾ r requir es at least r − 1 classes to beat c ⋆ ’s count. Fix any set T of r − 1 incorrect classes. Their combined mass is p T ⩽ 1 − p . For each c ′ ∈ T to beat c ⋆ , we need n ( c ′ ) > n ( c ⋆ ) , which in particular requires the average count of T to exceed Kp/ ( r − 1 ) . By Hoeffding applied to the sum of counts in T : P ( all c ′ ∈ T beat c ⋆ ) ⩽ P 1 K X c ′ ∈ T n ( c ′ ) > p ! ⩽ exp  − K ( p − p T ) 2 2  . RELIABILITY CER TIFICA TION FOR AI AGENTS 11 A union bound over  |C| r − 1  choices of T gives ( 4.4 ). Part 3. Let q K := P ( c ( 1 ) = c ⋆ ) ⩾ 1 − exp (− 2 K ( p − 1 / 2 ) 2 ) . Then V ar ( 1 { c ( 1 ) = c ⋆ } ) = q K ( 1 − q K ) ⩽ 1 − q K ⩽ exp (− 2 K ( p − 1 / 2 ) 2 ) . □ Corollary 4.5 (Sample complexity for δ -r eliable mode identiﬁcation) . T o ensure P ( c ( 1 ) = c ⋆ ) ⩾ 1 − δ , it sufﬁces to take (4.6) K ⩾ ln ( 1 /δ ) 2 ( p − 1 / 2 ) 2 . For example, with p = 0.7 and δ = 0.01 : K ⩾ ⌈ ln ( 100 ) / ( 2 · 0.04 ) ⌉ = 58 . 4.5. Canonicalization as variance reduction. Canonicalization does more than enable aggregation—it provably reduces the variance of the consensus by consolidating frag- mented probability mass. Proposition 4.6 (Canonicalization ampliﬁes consensus) . Let p raw := max a ∈ A ⋆ ( x ) P θ ( a | x ) be the pr obability of the most likely acceptable raw answer , and p canon := P θ ( Canon ( x , f θ ( x )) ∈ Canon ( x , A ⋆ ( x ))) the probability of the acceptable canonical class . Then: (4.7) p canon ⩾ p raw , with equality only when canonicalization is the identity. If L raw answers map to the same ac- ceptable canonical class, each with probability ⩾ p min , then p canon ⩾ L · p min . Consequently , the variance reduction exponent improves fr om 2 K ( p raw − 1 / 2 ) 2 to 2 K ( p canon − 1 / 2 ) 2 , and the sample complexity ( 4.6 ) decreases by a factor of  ( p raw − 1 / 2 ) / ( p canon − 1 / 2 )  2 . Proof. By deﬁnition, p canon = P a : Canon ( x , a )= c ⋆ P θ ( a | x ) ⩾ max a P θ ( a | x ) · 1 { a ∈ A ⋆ ( x ) } = p raw . The improvement in the Hoeffding exponent follows directly fr om sub- stituting p canon for p raw in ( 4.3 ). □ Remark 4.7 (Canonicalization can change the p ⩽ 1 / 2 regime) . A crucial practical conse- quence: an agent may have p raw < 1 / 2 (no single raw answer dominates) but p canon > 1 / 2 (the correct canonical class dominates after consolidation). Canonicalization can trans- form a regime where self-consistency ampliﬁes bias into one where it reduces variance. This provides a formal justiﬁcation for investing in high-quality canonicalization. 4.6. Ranked consensus. W e rank distinct canonical answers by decr easing empirical fr e- quency: (4.8) rank ( c ; x ) :=   { c ′ ∈ C ( x ) : ˆ P K ( c ′ | x ) > ˆ P K ( c | x ) }   + 1, with ties broken uniformly at random, pr oducing an ordering c ( 1 ) , c ( 2 ) , . . . . Deﬁnition 4.8 (Consensus strength and margin) . The consensus strength is ϕ ( x ) := ˆ P K ( c ( 1 ) | x ) and the consensus margin is ∆ ( x ) := ˆ P K ( c ( 1 ) | x ) − ˆ P K ( c ( 2 ) | x ) . Consensus strength measures the model’s overall conﬁdence (does one answer dom- inate, or do many answers tie?). Consensus margin measures the gap between the top two—a lar ge margin means the model str ongly favors one answer . Both quantities feed into the sequential stopping rule (Section 9 ). W ith the ranked consensus in hand, we now construct prediction sets that adapt their size to each query’s difﬁculty . 12 CHARAFEDDINE MOUZOUNI 5. S E T - V A L U E D P R E D I C T I O N S Instead of committing to a single answer , we output the top M most frequent candi- dates. For easy questions, M = 1 sufﬁces; for hard ones, a larger M is needed. The key question is how to choose M with a guarantee—that is the role of conformal calibration in the next section. Given ranked canonical answers c ( 1 ) , c ( 2 ) , . . . for query x , deﬁne the top- M prediction set : (5.1) S M ( x ) := { c ( 1 ) , c ( 2 ) , . . . , c ( M ) } . The family { S M ( x ) } M ⩾ 1 is nested: S 1 ( x ) ⊆ S 2 ( x ) ⊆ · · · . A ﬁxed M ignores the varying dif ﬁculty of queries. W e seek a data-driven procedure that selects M = M ( x ) adaptively with a formal coverage guarantee—the setting of con- formal prediction . 6. C O N F O R M A L C A L I B R AT I O N 6.1. Background: split conformal prediction. Conformal prediction [ 2 , 3 ] is a distribution-free framework for constructing prediction sets with guaranteed marginal coverage. Its single requirement is that calibration and test data are interchangeable— their joint distribution does not depend on ordering. Deﬁnition 6.1 (Exchangeability) . Random variables Z 1 , . . . , Z n + 1 are exchangeable if their joint distribution is invariant under all permutations σ : ( Z 1 , . . . , Z n + 1 ) d = ( Z σ ( 1 ) , . . . , Z σ ( n + 1 ) ) . 6.2. Nonconformity scores from ranked consensus. The nonconformity score measur es how “surprising” the corr ect answer is in the ranked list. If the corr ect answer is the most frequent (rank = 1 ), the model is conﬁdent and the score is low . If the correct answer is buried at rank 5 , the score is high—the model’s consensus disagr ees with the truth. Deﬁnition 6.2 (Rank-based nonconformity score) . Let x be a query , { c ( 1 ) , c ( 2 ) , . . . } the ranked canonical answers from K self-consistency samples, and y ∈ A ⋆ ( x ) an acceptable answer with canonical form c y := Canon ( x , y ) . The rank-based nonconformity score is (6.1) s ( x , y ) := rank ( c y ; x ) = min { r ∈ N : c y ∈ S r ( x ) } . If c y / ∈ C ( x ) , set s ( x , y ) := + ∞ . Deﬁnition 6.3 (Cumulative-pr obability nonconformity score) . An alternative score incor- porating frequency information: (6.2) s cp ( x , y ) := rank ( c y ; x ) X r = 1 ˆ P K ( c ( r ) | x ) ∈ [ 0, 1 ] . 6.3. The calibration procedure. Assumption 6.4 (Calibration data) . W e have a calibration dataset D cal = { ( x i , y i ) } n i = 1 where each y i ∈ A ⋆ ( x i ) , drawn exchangeably with the test example ( x n + 1 , y n + 1 ) . Remark 6.5 (When exchangeability holds and when it breaks) . Exchangeability is satis- ﬁed when calibration and test examples are drawn i.i.d. fr om the same distribution—the standard setting of benchmark evaluation with random train/test splits. It also holds RELIABILITY CER TIFICA TION FOR AI AGENTS 13 under random subsampling from any ﬁxed dataset, regar dless of how the dataset was originally constructed. Exchangeability can be violated in several practically relevant scenarios: (1) Curated benchmark sets : if items wer e hand-selected to emphasize difﬁcult cases or speciﬁc capabilities, the calibration and test distributions may differ systematically . (2) T emporal drift : if the agent is updated between calibration and deployment, the scor e distribution shifts. Periodic recalibration (re-running the calibration procedure on fresh data fr om the updated agent) is the standard mitigation. (3) Adversarial construction : if test queries are chosen adversarially after observing the calibration set, exchangeability fails by design. This is outside our threat model. When exchangeability is only appr oximately satisﬁed (e.g., mild covariate shift between calibration and test), weighted conformal prediction (Section 10.3 ) provides a princi- pled correction by r eweighting calibration scores accor ding to the likelihood ratio p test ( x ) /p cal ( x ) , preserving coverage under the test distribution. For each calibration example ( x i , y i ) : (1) Draw K self-consistency samples for query x i ; canonicalize and rank. (2) Compute s i := s ( x i , y i ) . Deﬁnition 6.6 (Conformal threshold) . Deﬁne k := ⌈ ( n + 1 )( 1 − α ) ⌉ and (6.3) M ⋆ := s ( k ) , the k -th smallest calibration score. For a test query x n + 1 , return: (6.4) S ( x n + 1 ) := S M ⋆ ( x n + 1 ) = { c ( 1 ) , . . . , c ( M ⋆ ) } . 6.4. Finite-sample coverage guarantee. Theorem 6.7 (Marginal coverage guarantee) . Let ( x 1 , y 1 ) , . . . , ( x n , y n ) , ( x n + 1 , y n + 1 ) be ex- changeable, and let s i = s ( x i , y i ) . Deﬁne M ⋆ as in ( 6.3 ) . Then: (6.5) P  y n + 1 ∈ S M ⋆ ( x n + 1 )  ⩾ 1 − α . If the scores have no ties a.s.: (6.6) 1 − α ⩽ P  s n + 1 ⩽ M ⋆  ⩽ 1 − α + 1 n + 1 . Proof. By exchangeability , the scores s 1 , . . . , s n + 1 are exchangeable. The rank of s n + 1 among { s 1 , . . . , s n + 1 } is uniformly distributed over { 1, . . . , n + 1 } . Deﬁne k := ⌈ ( n + 1 )( 1 − α ) ⌉ . The event { s n + 1 ⩽ M ⋆ } occurs when s n + 1 ’s rank is at most k . In the no-ties case: P ( s n + 1 ⩽ M ⋆ ) = k n + 1 = ⌈ ( n + 1 )( 1 − α ) ⌉ n + 1 . Since ⌈ ( n + 1 )( 1 − α ) ⌉ ⩾ ( n + 1 )( 1 − α ) , we get P ⩾ 1 − α . Since ⌈ ( n + 1 )( 1 − α ) ⌉ ⩽ ( n + 1 ) ( 1 − α ) + 1 , we get P ⩽ 1 − α + 1 / ( n + 1 ) . W ith ties, coverage can only increase. □ 14 CHARAFEDDINE MOUZOUNI 6.5. Adaptive prediction sets. Using the cumulative-pr obability scor e s cp and its confor- mal quantile ˆ q cp 1 − α : (6.7) S cp ( x ) :=  c ( r ) : r = 1, . . . , R ( x )  , R ( x ) := min    r : r X j = 1 ˆ P K ( c ( j ) | x ) ⩾ ˆ q cp 1 − α    . For high-consensus queries, R ( x ) is small; for diffuse queries, it is larger . Coverage is preserved by the same conformal ar gument. 7. R E L I A B I L I T Y G U A R A N T E E S Conformal set-valued evaluation is exactly calibrated : the evaluation’s coverage state- ment has bounded error . It is also transparently diagnostic of agent quality—agent bias is surfaced thr ough pr ediction set size rather than hidden. These two pr operties distinguish the agent’ s systematic errors (diagnosed but not ﬁxed) from the evaluation’ s systematic er - rors (pr ovably controlled). 7.1. Conservative coverage guarantee. The evaluation’s coverage error shrinks to zero as the calibration set grows—unlike point estimators, wher e bias persists indeﬁnitely . Theorem 7.1 (Coverage error control) . Under the conditions of Theorem 6.7 , the coverage gap satisﬁes (7.1) 0 ⩽ P ( Y n + 1 ∈ S ( X n + 1 )) − ( 1 − α ) ⩽ 1 n + 1 . That is, the evaluation is conservative (never under-covers) and the over-coverage vanishes as n → ∞ . Proof. This is immediate from ( 6.6 ): the lower bound gives conservativeness, and the upper bound gives the 1 / ( n + 1 ) slack. □ Remark 7.2 (Contrast with score-based evaluation) . Conformal calibration is the only method whose evaluation error vanishes with sample size. For single-sample evalua- tion, the error analog is V ar ( p ⋆ ( X )) , which depends on the query distribution and does not shrink. For LLM-as-judge, the error includes the irreducible judge bias E [ b J ] (Re- mark 3.5 ). Neither can be eliminated by collecting more data. 7.2. Bias immunity: coverage holds regardless of agent quality. Theorem 7.3 (Bias immunity of conformal coverage) . The coverage guarantee ( 6.5 ) holds for any agent f θ , regardless of: (1) the agent’ s per-query acceptability rate p ⋆ ( x ) (including p ⋆ ( x ) = 0 ), (2) the presence of systematic bias or stable hallucinations, (3) the output distribution P θ ( · | x ) , (4) the number of acceptable answers |A ⋆ ( x ) | . Formally: let Θ denote the space of all possible agent parameters. Then (7.2) inf θ ∈ Θ P θ  Y n + 1 ∈ S ( X n + 1 )  ⩾ 1 − α , where P θ denotes the joint distribution over calibration data, agent samples, and the test example under agent θ . RELIABILITY CER TIFICA TION FOR AI AGENTS 15 Proof. The proof of Theorem 6.7 uses only exchangeability of ( x i , y i ) , which is a prop- erty of the data-generating process, not of the agent. The agent enters only thr ough the nonconformity scores s i , and the conformal argument is valid for any score distribution. Speciﬁcally: For any ﬁxed θ , the scores s 1 , . . . , s n + 1 are determined by the exchangeable data ( x 1 , y 1 ) , . . . , ( x n + 1 , y n + 1 ) and the independent agent samples { a ( i ) j } . Since agent sam- ples for differ ent queries are independent, and the ( x i , y i ) are exchangeable by as- sumption, the scores remain exchangeable. The uniform rank argument then yields P θ ( s n + 1 ⩽ M ⋆ ) ⩾ 1 − α for every θ . □ Remark 7.4 (What “bias immunity” means and does not mean) . Theorem 7.3 is about the evaluation’ s accuracy , not the agent’s quality . It does not ﬁx or r emove agent bias. It guar - antees that the coverage statement—“an acceptable answer lies in S ( x ) with probability ⩾ 1 − α ”—is true regar dless of how biased the agent is. Concretely: a completely broken agent earns M ⋆ = + ∞ (honestly reﬂecting failure), while a strong agent earns M ⋆ = 1 . In both cases the coverage guarantee holds. A single calibration procedur e is valid for any agent, including ones with unknown or adversarial bias proﬁles. 7.3. Bias transparency: set size as honest quality diagnostic. The prediction set size | S ( x ) | is a diagnostic signal that faithfully reﬂects the agent’s quality . The following the- orem makes this pr ecise: (1) better agents get smaller sets; (2) perfect agents get single- tons; (3) agents that cannot solve a task get inﬁnitely large sets, honestly reﬂecting failur e; (4) expected set size scales with average answer rank. Theorem 7.5 (Bias transparency—set size r eﬂects agent quality) . Let M ⋆ ( θ , α ) denote the conformal threshold for agent θ at level α . Then: (1) Monotonicity in agent quality: If agent θ 1 stochastically dominates agent θ 2 in the sense that s ( θ 1 ) i ⩽ st s ( θ 2 ) i for each calibration example i (i.e., θ 1 consistently produces acceptable answers at higher ranks), then (7.3) M ⋆ ( θ 1 , α ) ⩽ M ⋆ ( θ 2 , α ) almost surely . (2) Perfect agent: If p ⋆ ( x ) = 1 for all x in the calibration distribution (the agent always produces acceptable answers), then s i = 1 for all i , and M ⋆ = 1 . (3) Biased agent: Let β := P X ( p ⋆ ( X ) < 1 /K ) be the fraction of queries where the agent is unlikely to produce an acceptable answer . If β > α , then M ⋆ = + ∞ (no ﬁnite prediction set sufﬁces at level α ). (4) Set size–bias correspondence: The expected set size satisﬁes (7.4) E [ | S ( X ) | ] ⩾ E  rank  Canon ( X , Y ( X )) ; X  · ( 1 − α ) − O ( 1 /n ) . Proof. Part 1. If s ( θ 1 ) i ⩽ st s ( θ 2 ) i for each i , then the k -th or der statistic satisﬁes s ( θ 1 ) ( k ) ⩽ s ( θ 2 ) ( k ) stochastically . Part 2. If p ⋆ ( x ) = 1 , then with pr obability 1 all K samples are acceptable, so the accept- able canonical class has rank 1 : s i = 1 for all i , and M ⋆ = s ( k ) = 1 . Part 3. If p ⋆ ( x i ) < 1 /K , then P ( s i = + ∞ ) ⩾ ( 1 − 1 /K ) K ⩾ e − 1 − o ( 1 ) > 0 . When β > α , the number of inﬁnite scores exceeds nβ > nα , so the ⌈ ( n + 1 )( 1 − α ) ⌉ -th score is + ∞ . Part 4. By deﬁnition, | S ( X ) | = M ⋆ for the global threshold. For the adaptive version, | S ( X ) | ⩾ rank ( c Y ( X ) ; X ) whenever Y ( X ) ∈ S ( X ) . T aking expectations and using P ( Y ∈ 16 CHARAFEDDINE MOUZOUNI S ( X )) ⩾ 1 − α : E [ | S ( X ) | ] ⩾ E [ rank ( c Y ( X ) ; X ) · 1 { Y ∈ S ( X ) } ] ⩾ E [ rank ( c Y ( X ) ; X )] · ( 1 − α ) − E [ rank · 1 { Y / ∈ S } ] . The second term is bounded and yields the O ( 1 /n ) correction. □ Remark 7.6 (Finite candidate spaces and under-coverage) . Part 3 pr edicts M ⋆ = + ∞ when β > α , but in practice M ⋆ is bounded by the number of distinct canonical classes |C ( x ) | observed in K samples. When |C ( x ) | is small (e.g., |C| = 4 for MCQ tasks), the confor- mal quantile r emains ﬁnite even though some items are unsolvable. Marginal coverage still falls below 1 − α because, for items where p ⋆ ( x ) = 0 , the correct class never enters the candidate pool—no prediction set over observed candidates can cover it. The under- coverage thus equals the unsolvable fraction β , not a calibration err or: conditional cov- erage on solvable items remains near -perfect (T able 2 ). Remark 7.7 (Bias is visible, not hidden) . The fundamental distinction from LLM-as-judge evaluation is that bias in the agent is surfaced thr ough larger pr ediction sets, not concealed in an opaque score. A practitioner who observes M ⋆ = 8 knows immediately that the agent frequently fails to rank acceptable answers highly . An LLM-judge score of 0.75 carries no such interpretable diagnostic—the same score could arise from high-quality answers with a harsh judge, or poor-quality answers with a lenient judge. 7.4. Formal comparison with alternative evaluation methods. W e now compare the conformal method’s coverage guarantee against two standard baselines—single-sample evaluation and LLM-as-judge—showing that conformal calibration achieves lower eval- uation error once the calibration set is lar ge enough to undercut the judge’s bias. Theorem 7.8 (Advantage over single-sample evaluation) . For any agent and any target reliability level 1 − δ , deﬁne the evaluation reliability as the probability that the evaluation’ s assessment is within ϵ of the truth. For single-sample evaluation applied to a test set of N queries: (7.5) P    ˆ p single − ¯ p   > ϵ  ⩽ 2 exp (− 2 Nϵ 2 ) (Hoeffding) , where ˆ p single = N − 1 P j 1 { a j ∈ A ⋆ ( x j ) } . For conformal set evaluation: (7.6) P    d Co v ( α ) − ( 1 − α )   > ϵ  ⩽ 2 exp (− 2 Nϵ 2 ) + 1 n + 1 , where d Co v is the empirical coverage on the test set. Both converge at rate O ( 1 / √ N ) , but the conformal method provides: (1) A per-query guarantee ( Y ( x ) ∈ S ( x ) ), not just an aggregate estimate. (2) An explicit r eliability level 1 − α chosen by the user , vs. an unknown ¯ p that must be estimated. (3) A diagnostic set size per query, r evealing per-query difﬁculty. Proof. Equation ( 7.5 ) is Hoeffding’s inequality for i.i.d. Bernoulli random variables. For ( 7.6 ): by Theorem 6.7 , P ( Y j ∈ S ( x j )) ∈ [ 1 − α , 1 − α + 1 / ( n + 1 )] marginally . The empirical coverage d Co v averages N such indicators (approximately independent across test queries). Hoeffding’s inequality applied to these indicators, centered at P ( Y ∈ S ( X )) , gives the exponential tail. The 1 / ( n + 1 ) term accounts for the gap between P ( Y ∈ S ( X )) and 1 − α . □ RELIABILITY CER TIFICA TION FOR AI AGENTS 17 Theorem 7.9 (Advantage over LLM-as-judge) . Let the LLM judge have systematic bias b J = E X , a [ b J ( X , a )]  = 0 and judge variance σ 2 J = E X , a [ V ar ( J ( X , a ) | X , a )] . Then for any sample size N : LLM-as-judge evaluation error: (7.7) MSE ( ˆ p J ) = b 2 J + σ 2 J + ¯ p ( 1 − ¯ p ) N + O ( N − 2 ) . The bias term b 2 J does not decay with N . Conformal evaluation error (coverage gap): (7.8)   P ( Y ∈ S ( X )) − ( 1 − α )   ⩽ 1 n + 1 . The err or is contr olled solely by the calibration set size n and is independent of the agent’s bias, the judge’s bias, or any systematic error . Proof. For the judge: ˆ p J = N − 1 P j J ( x j , a j ) . Then E [ ˆ p J ] = ¯ p + b J , so Bias ( ˆ p J ) = b J . The variance is V ar ( ˆ p J ) = N − 1 V ar ( J ( X , a )) , decomposed via the law of total variance into agent variance and judge variance. The MSE follows. For conformal: this is a restatement of Theor em 7.1 . □ Corollary 7.10 (When does conformal evaluation achieve lower error?) . Under the MSE comparison criterion (wher e coverage gap and judge bias ar e both expressed as deviations from the true acceptability rate), conformal set evaluation achieves lower coverage err or than LLM-as-judge whenever (7.9) | b J | > 1 n + 1 . For refer ence, if the judge’ s bias reaches | b J | ⩾ 0.05 —a level documented in the literature [ 8 ] for subjective evaluation tasks—then n ⩾ 19 sufﬁces. On structured tasks with unambiguous answers (e.g., GSM8K, MMLU), judge accuracy can exceed conformal coverage (T able 2 ), imply- ing smaller effective bias and a correspondingly larger cr ossover point. The comparison is most informative on ambiguous tasks where judge bias is hardest to bound. Note that this compari- son is meaningful when both methods are assessed by their distance to the true acceptability rate; conformal sets and judge scores ar e complementary evaluation outputs (Remark 7.11 ). Remark 7.11 (Complementary , not universally dominant) . The “advantage” in Corol- lary 7.10 compares coverage gap (our method) against MSE (point estimators). These metrics answer differ ent questions: coverage gap measures whether the evaluation’s reliability claim is valid, while MSE measures how accurately the evaluation estimates the agent’s true accuracy ¯ p . For reliability certiﬁcation —“can I trust this agent at level 1 − α ?”—conformal calibration is strictly superior once n ⩾ ⌈ 1 / | b J | ⌉ − 1 (e.g., n ⩾ 19 when | b J | = 0.05 ). For accuracy estimation —“what fraction of queries does the agent an- swer corr ectly?”—simple point estimators with conﬁdence intervals remain useful. The methods are complementary: practitioners should use conformal sets for per -query reli- ability assessment and point estimates for aggregate performance r eporting. 7.5. V ariance of set-size as a quality estimator . Since set size serves as a quality diagnos- tic, we need it to be a stable estimator—the average set size should concentrate around its true mean as the test set gr ows. 18 CHARAFEDDINE MOUZOUNI Proposition 7.12 (Concentration of the set-size estimator) . Let ¯ M := N − 1 P N j = 1 | S ( x j ) | be the average set size on a test set of N queries. If | S ( x ) | ⩽ B almost surely (bounded set size), then (7.10) P    ¯ M − E [ | S ( X ) | ]   > t  ⩽ 2 exp  − 2 Nt 2 B 2  . The average set size concentrates ar ound its expectation at rate O ( B/ √ N ) and pr ovides a r eliable, low-variance estimator of agent quality . Proof. Each | S ( x j ) | ∈ [ 1, B ] is bounded. Apply Hoeffding’s inequality . □ These variance bounds on set size motivate the next question: how does the coverage– efﬁciency trade-of f depend on agent competence? 8. C O V E R A G E – E FFI C I E N C Y T R A D E - O FF A better agent produces smaller prediction sets—but how much smaller? This section quantiﬁes the relationship between agent quality and set size, showing that the predic- tion set is an efﬁcient diagnostic: str ong agents need only singleton sets, while weak agents unavoidably requir e larger ones. 8.1. Set size as a function of agent competence. Proposition 8.1 (Expected rank under single-acceptable-class) . Suppose there is a unique acceptable canonical class c ⋆ with p = P θ ( c i = c ⋆ ) > 1 / 2 . Then: (8.1) E [ rank ( c ⋆ ; x )] ⩽ 1 + 1 − p p · K − 1 K K → ∞ − − − − → 1 + 1 − p p = 1 p . If p > 1 / 2 , then E [ rank ( c ⋆ ; x )] < 2 for all K . Proof. rank ( c ⋆ ) = 1 + |{ c ′  = c ⋆ : n ( c ′ ) ⩾ n ( c ⋆ ) }| . For any competing class c ′ with proba- bility p c ′ < p : P ( n ( c ′ ) ⩾ n ( c ⋆ )) ⩽ P ( n ( c ′ ) ⩾ Kp/ 2 ) + P ( n ( c ⋆ ) ⩽ Kp/ 2 ) . Summing over all competing classes (total mass 1 − p ) and applying a stochastic domi- nance argument: E [ rank ( c ⋆ )] ⩽ 1 + ( 1 − p ) /p · ( K − 1 ) /K . □ 8.2. Set size inﬂation under ambiguity. Proposition 8.2 (Ambiguity inﬂates calibration scores) . If a fraction β of calibration queries have p ⋆ ( x i ) < 1 /K and β > α , then M ⋆ = + ∞ . Proof. At least βn scores are + ∞ with high probability . The k -th order statistic with k = ⌈ ( n + 1 )( 1 − α ) ⌉ is inﬁnite when k > n − βn , which holds when β > α + 1 / ( n + 1 ) . □ Remark 8.3 . This is a feature, not a bug: M ⋆ = + ∞ tells the practitioner that the agent cannot reliably serve this query population at the desired conﬁdence level. No other evaluation method provides such a clear signal. 9. S E Q U E N T I A L S A M P L I N G W I T H C E R T I FI E D E A R LY S T O P P I N G Drawing K samples per query can be expensive. W e develop sequential procedur es that stop early when consensus is clear , reducing cost without sacriﬁcing coverage. 9.1. The sequential consensus problem. After k samples, let ˆ p ( k ) 1 and ˆ p ( k ) 2 be the fre- quencies of the top two candidates, with margin ∆ k := ˆ p ( k ) 1 − ˆ p ( k ) 2 . RELIABILITY CER TIFICA TION FOR AI AGENTS 19 9.2. Hoeffding-based stopping criterion. Theorem 9.1 (Certiﬁed mode identiﬁcation) . Deﬁne the stopping time (9.1) τ δ := min  k ⩾ k 0 : ∆ k > r 2 ln ( 2 |C ( x ) | k 2 /δ ) k  . If the true mode c ⋆ has p 1 > p 2 := max c  = c ⋆ P θ ( c | x ) , then P ( c ( τ δ ) ( 1 ) = c ⋆ ) ⩾ 1 − δ . Proof. At step k , by Hoeffding and a union bound over |C ( x ) | classes: P  ∃ c : | ˆ P k ( c | x ) − P θ ( c | x ) | > ϵ  ⩽ 2 |C ( x ) | exp (− 2 kϵ 2 ) . Setting ϵ = ∆ k / 2 and requiring the bound to be ⩽ δ/k 2 (enabling a sum over k via P 1 /k 2 < 2 ) yields ( 9.1 ). When triggered, the true frequencies are within ∆ k / 2 of empir - ical values with probability ⩾ 1 − δ , so the empirical mode equals the true mode. □ 9.3. V ariance reduction from sequential stopping. Proposition 9.2 (V ariance–cost trade-of f of sequential stopping) . Let K max be the maximum sample budget and τ the stopping time from ( 9.1 ) . Then: (1) Easy queries (large true margin p 1 − p 2 = ∆ > 0 ): E [ τ ] = O ( ∆ − 2 ln ( |C| /δ )) , matching the classical sample complexity of ﬁxed-conﬁdence best-arm identiﬁcation [ 12 ] . (2) Hard queries (small ∆ ): E [ τ ] ≈ K max (no savings). The cost reduction is concentrated on queries where variance is alrea dy low (high consensus), preserving the framework’ s reliability exactly where it is most needed. 9.4. V alidity of conformal guarantee under adaptive stopping. Proposition 9.3 (Conformal validity with adaptive K ) . If the stopping rule K i = K i ( x i , a ( i ) 1 , . . . ) depends only on x i and agent samples (not on y i ), then s 1 , . . . , s n + 1 remain exchangeable and Theorem 6.7 holds. Proof. The score s i is a function of ( x i , y i ) and the agent samples { a ( i ) j } K i j = 1 . Since K i depends only on ( x i , { a ( i ) j } ) and not on y i , and ( x i , y i ) ar e exchangeable, the scores inherit exchangeability . □ Note that the stopping rule (Theorem 9.1 ) certiﬁes mode identity , not the full rank or - dering. For items wher e s i = 1 (the correct answer is the mode), mode stability implies rank stability . For items with s i > 1 , rank ﬂuctuations after stopping may af fect pr edic- tion set size but not coverage validity , since Proposition 9.3 guarantees exchangeability regar dless of when sampling stops. 10. E X T E N S I O N S 10.1. Multiple acceptable canonical classes. When | Canon ( x , A ⋆ ( x )) | > 1 , use: (10.1) s i = min l = 1,..., L i rank  Canon ( x i , y ( l ) i ) ; x i  . Coverage is preserved since exchangeability is maintained. 20 CHARAFEDDINE MOUZOUNI 10.2. Dependence-aware extensions. Remark 10.1 (Separation of concerns) . A key structural insight: the conformal coverage guarantee depends on exchangeability of calibration-test pairs ( x i , y i ) , not on indepen- dence of the K within-query samples. Dependence among the K samples affects ranking quality (and hence set size ) but not coverage validity . If dependence degrades the rank- ing, M ⋆ grows—the framework self-corrects by widening the set—without br eaking the guarantee. This separation is what makes the method robust to the poorly characterized dependence structur e of LLM API calls. 10.3. W eighted conformal prediction. Under covariate shift between calibration and test distributions, use likelihood-ratio-weighted conformal prediction [ 6 ]: (10.2) M ⋆ w := weighted quantile of s 1 , . . . , s n with weights w i ∝ p test ( x i ) p cal ( x i ) . 11. E M P I R I C A L S T U D Y W e ﬁrst validated all theor etical r esults on controlled synthetic agents with known pa- rameters (Appendix C ); all predictions—coverage calibration, exponential variance de- cay , set-size monotonicity , and canonicalization ampliﬁcation—ar e conﬁrmed. W e then evaluate on ﬁve real benchmarks spanning code generation, mathematical reasoning, open-ended question answering, and multiple-choice selection, using ﬁve models from three families (GPT -4.1 ladder , Llama 4 Maverick, Mistral Small 24B). Four questions structur e the evaluation: (Q1) Does conformal calibration achieve coverage ⩾ 1 − α empirically (Theorem 6.7 )? (Q2) Does self-consistency reduce variance as K increases (Theorem 4.4 )? (Q3) Does set size adapt to model uncertainty (Theorem 7.5 )? (Q4) Does the framework achieve lower coverage error than single-sample and LLM-judge baselines (Theorems 7.8 , 7.9 )? Hypotheses. W e formalize six empirical hypotheses, each derived from a speciﬁc theo- retical r esult: (H1) Coverage validity . Empirical coverage meets the 1 − α target on tasks wher e the model has sufﬁcient capability; any shortfall is attributable to model capability gaps, not cal- ibration failure (Theor em 6.7 ). (H2) V ariance reduction. Mode identiﬁcation error decreases with K , with the largest re- ductions on high-accuracy tasks where p ⋆ is far from 0.5 (Theorem 4.4 ). (H3) Adaptive set size. Pr ediction set size correlates positively with per-item answer en- tropy (Theor em 7.5 ). (H4) Canonicalization beneﬁt. Canonicalization reduces prediction set size on tasks with surface-form variation (Proposition 4.6 ). (H5) Baseline comparison. Conformal coverage meets or exceeds LLM-judge accuracy on ambiguous tasks where judge bias is non-negligible (Cor ollary 7.10 ). (H6) Sequential efﬁciency . The Hoeffding-based stopping rule reduces the average num- ber of samples with no loss in coverage (Theorem 9.1 ). 11.1. Experimental setup. RELIABILITY CER TIFICA TION FOR AI AGENTS 21 Datasets. W e select ﬁve benchmarks r epresenting distinct task families and canonicaliza- tion strategies: (1) HumanEval (code generation, 164 items). Each response is executed in a sandboxed environment against unit tests; canonicalization is deterministic binary: pass or fail . This represents the lowest-ambiguity setting where correctness is objectively veriﬁ- able. (2) T ruthfulQA (open-ended QA, 817 items). Fr ee-form text responses where surface- form variation is extreme—even correct answers differ substantially in phrasing. Canonicalization uses an LLM judge (GPT -4.1 at temperatur e 0 ) to classify each sam- ple as correct or incorr ect , yielding a binary answer space analogous to code execution. (3) BigBench MovieRec (multiple-choice, 500 items; hereafter “BigBench”). Responses are canonicalized via option matching and text normalization to the selected movie title. (4) GSM8K (mathematical reasoning, 1319 items) [ 11 ]. Grade-school math word prob- lems requiring multi-step arithmetic. Canonicalization is deterministic numeric ex- traction: we parse the ﬁnal answer after the #### delimiter and normalize to a stan- dard integer form. This tests the framework on a task with a unique correct answer but diverse reasoning paths. (5) MMLU (multiple-choice knowledge, 1000 items sampled from the full test set). Four- option questions spanning 57 academic subjects. Canonicalization reuses the MCQ pipeline (option matching and text normalization). Agent and sampling conﬁguration. W e use the GPT -4.1 model family as a capability lad- der with unambiguous ordering—GPT -4.1 (strong), GPT -4.1-mini (mid-tier), GPT -4.1- nano (weak)—providing a clean test of the set-size monotonicity prediction. All mod- els ar e evaluated at temperature T = 0.7 with K max = 20 independent samples per item ( K ﬁxed = 10 for both calibration and test-time pr ediction in primary results): • GPT -4.1 (strong): evaluated on all ﬁve benchmarks. Expected to yield the smallest prediction sets. • GPT -4.1-mini (mid-tier): evaluated on GSM8K and MMLU, the two benchmarks with largest calibration sets ( n cal = 500 ). • GPT -4.1-nano (weak): evaluated on GSM8K and MMLU. Expected to yield the lar gest prediction sets. As a supplementary comparison, we evaluate GPT -5 mini on all ﬁve benchmarks. Un- der our T = 0.7 i.i.d. sampling pr otocol (without extended thinking), GPT -5 mini exhibits lower per-sample accuracy than GPT -4.1. Throughout, M ⋆ and the reliability level are properties of a speciﬁc (model, infer ence conﬁguration) pair—not of the model ar chitectur e alone. The same model can yield differ ent prediction sets depending on temperature, decoding strategy , and whether capabilities like extended thinking are activated. T o test cross-family generalizability , we evaluate two open-weight models via T ogether AI on GSM8K, MMLU, and T ruthfulQA: • Llama 4 Maverick 17B ( meta-llama/Llama-4-Mav eric k-17B-128E-Instruct-FP8 ): a mid-tier mixture-of-experts model fr om Meta. • Mistral Small 24B ( mistralai/Mistral-Small-24B-Instruct-2501 ): a smaller instruction- tuned model from Mistral AI. The calibration/test split uses up to n cal = n test = 500 items per dataset (or half the dataset if smaller , e.g., HumanEval uses 82/82). The LLM judge is ﬁxed at GPT -4.1 (tem- perature 0 ) across all experiments to avoid confounding evaluation quality with agent 22 CHARAFEDDINE MOUZOUNI T A B L E 2 . Main results ( T = 0.7 , K = 10 , α = 0.10 ). Coverage target is 1 − α = 0.90 . 95% W ilson CIs in parentheses. Results shown for GPT -4.1; see T able 3 for cross-model comparison. Dataset n cal / n test M ⋆ Coverage Mode Acc Judge Acc A vg. | S | Cov | solv HumanEval 82 / 82 2 0.707 ( 0.60–0.79 ) 0.646 ( 0.54–0.74 ) 0.915 † 1.32 0.967 T ruthfulQA 408 / 409 1 0.976 ( 0.96–0.99 ) 0.976 ( 0.96–0.99 ) 0.966 ( 0.94–0.98 ) 1.00 0.980 BigBench 125 / 125 2 0.792 ( 0.71–0.85 ) 0.768 ( 0.69–0.83 ) 0.736 ( 0.65–0.81 ) 1.10 1.000 GSM8K 500 / 500 1 0.956 ( 0.93–0.97 ) 0.956 ( 0.93–0.97 ) 0.986 ( 0.97–0.99 ) 1.00 0.980 MMLU 500 / 500 2 0.838 ( 0.80–0.87 ) 0.798 ( 0.76–0.83 ) 0.962 ( 0.94–0.98 ) 1.11 0.988 † The LLM judge cannot execute code; it evaluates syntactic plausibility rather than functional correctness, making the judge–conformal comparison structurally invalid on HumanEval (the methods measure dif ferent things). HumanEval is excluded from all judge comparisons in the text. “Cov | solv” = conditional coverage among items where the model pr oduced at least one correct answer acr oss all K max samples—a post-hoc diagnostic that validates the theory but is not available at deployment time (see Section 11.9 ). Capability gaps: HumanEval 26.8%, BigBench 20.8%, MMLU 15.2%, GSM8K 2.4%, T ruthfulQA 0.5%. capability . All API responses are SHA-256 cached so that re-runs ar e deterministic and cost-free. Evaluation protocol. All experiments use the rank-based nonconformity score (Deﬁni- tion 6.2 ): for each calibration item, the score is the rank of the ﬁrst acceptable answer in the self-consistency ordering. The adaptive score variant (Section 6.5 ) is deferred to future work. For each dataset and model we: (1) compute nonconformity scor es { s i } n cal i = 1 on the cal- ibration set and determine M ⋆ for α ∈ { 0.01, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30 } ; (2) evaluate coverage, mode accuracy , single-sample accuracy , LLM-judge accuracy , and prediction set size on the test set; (3) sweep K ∈ { 1, 2, 5, 10, 20 } for variance reduction; (4) run the Hoeffding-based sequential stopping rule ( δ = 0.05 ). All reported proportions include 95% W ilson conﬁdence intervals [ 14 ]; continuous metrics (set size, entropy , average K ) include 95% bootstrap percentile conﬁdence intervals ( B = 10,000 r esamples). Baselines. W e compare four evaluation strategies: (1) Single-sample: one draw from the agent, binary corr ectness. (2) Self-consistency mode ( K = 10 ): majority-vote answer , binary correctness. (3) LLM-as-judge: GPT -4.1 scores the ﬁrst sample; scor e ⩾ 0.5 counts as corr ect. (4) Conformal prediction set : our method with calibrated threshold M ⋆ . 11.2. Main results. T able 2 reports the primary metrics across all ﬁve benchmarks. Coverage validity (Q1). T able 2 reports coverage across all ﬁve benchmarks for GPT -4.1. T wo benchmarks (GSM8K and T ruthfulQA) exceed the 90% target with M ⋆ = 1 ; three fall below: MMLU ( 0.838 ), BigBench ( 0.792 ), and HumanEval ( 0.707 ). On these thr ee benchmarks, the method provides no marginal coverage guarantee at α = 0.10 —this is a direct consequence of Theorem 7.5 (3): when the fraction of unsolvable items exceeds α , no evaluation method can achieve the target. Conditional coverage among solvable items exceeds 0.96 on all ﬁve benchmarks, conﬁrming that under -coverage is driven by model capability gaps (ranging from 2.4% on GSM8K to 26.8% on HumanEval; see T a- ble 2 footnote) rather than calibration failure. RELIABILITY CER TIFICA TION FOR AI AGENTS 23 T A B L E 3 . Multi-model comparison at α = 0.10 : M ⋆ and average set size. T op: GPT -4.1 capability ladder on GSM8K and MMLU. Bottom: Open- weight cross-family validation on GSM8K, MMLU, and T ruthfulQA. M ⋆ and ¯ | S | are non-decreasing with decr easing capability across all model families. GPT -4.1 GPT -4.1-mini GPT -4.1-nano Dataset M ⋆ ¯ | S | M ⋆ ¯ | S | M ⋆ ¯ | S | GSM8K 1 1.00 1 1.00 2 1.35 MMLU 2 1.11 2 1.12 2 1.17 GPT -4.1 Llama 4 Maverick Mistral Small Dataset M ⋆ ¯ | S | M ⋆ ¯ | S | M ⋆ ¯ | S | GSM8K 1 1.00 1 1.00 1 1.00 MMLU 2 1.11 2 1.12 2 1.32 T ruthfulQA 1 1.00 2 1.25 2 1.31 GPT -4.1 ladder . On both benchmarks, ¯ | S | incr eases monotonically acr oss the capability ladder; M ⋆ increases on GSM8K ( 1 → 1 → 2 ) and remains constant at 2 on MMLU. On MMLU, the capability gap is 15.2% (GPT -4.1), 14.6% (GPT -4.1-mini), and 26.2% (GPT -4.1-nano), with conditional coverage remaining high: 0.988 , 0.979 , and 0.965 respectively . On GSM8K, all three models achieve coverage ⩾ 0.948 with conditional coverage ⩾ 0.973 . Open-weight models. Llama 4 Maverick and Mistral Small 24B conﬁrm cross-family generalizability . On GSM8K, all three families achieve M ⋆ = 1 with coverage ⩾ 0.956 . On T ruthfulQA, open-weight models requir e M ⋆ = 2 (vs. 1 for GPT -4.1), with coverage ⩾ 0.960 and conditional coverage ⩾ 0.992 . On MMLU, the capability gap— 25.4% (Maverick) and 13.0% (Mistral)—drives mar ginal coverage below 90% , but conditional coverage remains high ( 0.995 and 0.949 ). Supplementary: GPT -5 mini. As a deployment-conﬁguration comparison, GPT -5 mini—a r easoning model evaluated without extended thinking—exhibits lower per-sample accuracy than GPT -4.1 on all ﬁve benchmarks: HumanEval ( M ⋆ : 2 → 4 , ¯ | S | : 1.32 → 2.43 ), BigBench ( 2 → 3 , 1.10 → 1.83 ), MMLU ( 2 → 3 , 1.11 → 1.59 ), GSM8K ( 1 → 2 , 1.00 → 1.42 ), T ruthfulQA ( 1 → 2 , 1.00 → 1.49 ). This illustrates that the framework assesses deployment-speciﬁc performance: the same model architecture can yield dif ferent prediction sets depending on infer ence conﬁguration. W ith n cal ⩾ 82 (the smallest dataset, HumanEval), the theor etical over-coverage bound is 1 / ( n + 1 ) ⩽ 0.012 , ensuring tight calibration. For the lar ger datasets ( n cal = 500 ), the bound tightens to 0.002 . Comparison with baselines (Q4). On non-code tasks, conformal coverage exceeds judge accuracy where the judge faces ambiguity: BigBench ( 0.792 vs. 0.736 ) and T ruthfulQA ( 0.976 vs. 0.966 ). On MMLU and GSM8K, the judge outperforms because these tasks have unambiguous correct answers ( 0.962 and 0.986 r espectively). Set-size monotonicity acr oss capability levels. The GPT -4.1 capability ladder (T able 3 ) di- rectly conﬁrms Theorem 7.5 : ¯ | S | increases monotonically across GPT -4.1 → GPT -4.1-mini → GPT -4.1-nano on both benchmarks. On GSM8K, GPT -4.1-nano’s M ⋆ rises to 2 (vs. 1 for the other two); on MMLU, all three share M ⋆ = 2 but ¯ | S | incr eases from 1.11 → 1.12 → 1.17 . Conditional coverage remains high across all three ( ⩾ 0.965 ). The open-weight models reinfor ce this pattern: on GSM8K, all three families achieve M ⋆ = 1 ; on T ruthfulQA, both 24 CHARAFEDDINE MOUZOUNI open-weight models requir e M ⋆ = 2 (vs. 1 for GPT -4.1) with coverage ⩾ 0.960 . As a sup- plementary comparison, GPT -5 mini—evaluated without extended thinking—exhibits higher M ⋆ and ¯ | S | than GPT -4.1 on all benchmarks (T able 3 footnote), illustrating that the framework assesses deployment-speciﬁc performance. Multi-sample judge comparison. Our primary LLM-judge baseline uses a single call per test item ( K judge = 1 ), matching the standard setup in most benchmark implementa- tions [ 8 ]. T o test whether aggregating multiple judge calls closes the gap, we evaluate a majority-vote judge with K judge ∈ { 1, 3, 5, 10 } on the two benchmarks where conformal coverage exceeds the single-call judge. On T ruthfulQA, the majority-vote judge improves from 0.966 ( K = 1 ) to 0.976 ( K = 10 ), converging to the conformal coverage level—but only at 10 × cost. On BigBench, the multi-sample judge shows no improvement ( 0.736 at K = 1 to 0.728 at K = 10 ), while conformal coverage achieves 0.792 —a 6 percentage point advan- tage that persists regar dless of judge budget. These results conﬁrm that the conformal method’s advantage over judging is not an artifact of an underpowered baseline. Cost-matched comparison. T o ensure a fair comparison, we match the total API cost be- tween conformal evaluation and the majority-vote judge (per-call costs in T able 7 foot- note). For MCQ tasks, the cost-matched judge budget is ∼ 14 calls; for T ruthfulQA (which adds judge-based canonicalization), ∼ 24 calls. W e extend the majority-vote judge to K judge ∈ { 15, 20 } to evaluate at these budgets. On BigBench , the judge saturates early: accuracy is 0.728 at both K = 15 and K = 20 , while conformal coverage r eaches 0.792 —a 6.4 percentage point gap, though 95% CIs overlap at this sample size ( n test = 125 ). The bot- tleneck is the judge’s per-call accuracy , not sampling noise: additional judge calls cannot improve a systematically incorrect judgment. On T ruthfulQA , the cost-matched judge ( K = 20 , accuracy 0.973 ) essentially matches conformal coverage ( 0.976 ), conﬁrming con- vergence when the judge is accurate. However , on non-text tasks (MCQ, math, code), conformal evaluation requir es no judge model at all—eliminating a potential source of evaluation bias while achieving comparable or superior coverage. Figure 2 shows the coverage validation plots acr oss all alpha values. Split stability (bootstrap analysis). T o assess whether M ⋆ is an artifact of the particular calibration/test split, we re-split the data with 100 independent random seeds and re- compute M ⋆ and empirical coverage for each split. T able 4 summarizes the results. For ﬁve of six model–dataset combinations, M ⋆ is identical across all 100 splits (std = 0 ). Only GPT -4.1-nano on GSM8K shows variation ( M ⋆ = 1 in 82% of splits, M ⋆ = 2 in 18%; coverage std = 0.023 , the highest in the table). This instability is informative: GPT - 4.1-nano sits at the exact capability boundary where the fraction of unsolvable GSM8K items is close to α = 0.10 , so the ⌈ ( n + 1 )( 1 − α ) ⌉ -th or der statistic ﬂuctuates between score values of 1 and 2 depending on which items fall in the calibration set. This is precisely the r egime where the r eliability level (Deﬁnition 2.4 : 89.8% for GPT -4.1-nano on GSM8K) provides a more stable signal than the discrete M ⋆ , since the reliability level depends on the count of s i ⩽ 1 scores rather than on a single order statistic. Coverage standard de- viations for the remaining ﬁve combinations range from 0.004 to 0.007 , conﬁrming that the conformal guarantee is robust to the choice of calibration partition when the model is well above or below the capability boundary . 11.3. V ariance reduction. T able 5 r eports mode error as a function of the number of self- consistency samples K . Q2: V ariance reduction. The synthetic validation (Appendix C , Figure 7 ) conﬁrms the predicted exponential decay under controlled conditions with known p ⋆ . The real-data RELIABILITY CER TIFICA TION FOR AI AGENTS 25 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 T a r g e t c o v e r a g e 1 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Empirical coverage Ideal TruthfulQA GSM8K HumanEval BigBench MML U F I G U R E 2 . Coverage validation: empirical coverage vs. target 1 − α for α ∈ { 0.01, . . . , 0.30 } across all ﬁve benchmarks. Points above the diago- nal are consistent with the marginal coverage guarantee (Theorem 6.7 ). Points below the diagonal (HumanEval, BigBench, MMLU) arise when the unsolvable fraction β exceeds α : Theorem 7.5 (3) predicts M ⋆ = + ∞ in this regime, meaning no ﬁnite prediction set can cover unsolvable items. Empirically , M ⋆ remains ﬁnite (capped by |C| ) but the resulting sets still cannot include an acceptable answer for queries that the model funda- mentally cannot solve. The under -coverage is thus a diagnosed capability gap, not a calibration failur e—conditional coverage on solvable items ex- ceeds 0.96 across all ﬁve benchmarks. T A B L E 4 . Bootstrap split stability: M ⋆ and coverage across 100 random cal/test partitions ( n cal = n test = 500 , α = 0.10 ). Model Dataset M ⋆ (mean ± std) Coverage (mean ± std) GPT -4.1 GSM8K 1.00 ± 0.00 0.951 ± 0.007 GPT -4.1 MMLU 2.00 ± 0.00 0.979 ± 0.005 GPT -4.1-mini GSM8K 1.00 ± 0.00 0.957 ± 0.007 GPT -4.1-mini MMLU 2.00 ± 0.00 0.976 ± 0.004 GPT -4.1-nano GSM8K 1.18 ± 0.38 0.917 ± 0.023 GPT -4.1-nano MMLU 2.00 ± 0.00 0.945 ± 0.007 results in T able 5 are consistent with this pattern. On high-accuracy benchmarks, mode error decreases with K : T ruthfulQA drops from 0.034 to 0.022 (35% reduction), and GSM8K drops fr om 0.064 to 0.046 (28% reduction). On moderate-accuracy benchmarks (HumanEval, BigBench, MMLU), mode error remains relatively ﬂat across K values, consistent with the theoretical prediction that variance reduction is most pronounced 26 CHARAFEDDINE MOUZOUNI T A B L E 5 . Mode err or \ Mo deErr ( K ) as a function of K for GPT -4.1. Lower is better . 95% W ilson CIs in parentheses. Results for GSM8K and MMLU add tasks with deterministic canonicalization (numeric extraction and MCQ matching, respectively). Dataset K = 1 K = 2 K = 5 K = 10 K = 20 HumanEval .354 ( .26–.46 ) .378 ( .28–.49 ) .366 ( .27–.47 ) .341 ( .25–.45 ) .341 ( .25–.45 ) T ruthfulQA .034 ( .02–.06 ) .044 ( .03–.07 ) .027 ( .02–.05 ) .024 ( .01–.04 ) .022 ( .01–.04 ) BigBench .248 ( .18–.33 ) .240 ( .17–.32 ) .240 ( .17–.32 ) .240 ( .17–.32 ) .240 ( .17–.32 ) GSM8K .064 ( .05–.09 ) .058 ( .04–.08 ) .048 ( .03–.07 ) .046 ( .03–.07 ) .046 ( .03–.07 ) MMLU .212 ( .18–.25 ) .206 ( .17–.24 ) .210 ( .18–.25 ) .204 ( .17–.24 ) .200 ( .17–.24 ) when p ⋆ is well above 0.5. W e note that adjacent K -to- K differ ences are not individually signiﬁcant (the W ilson CIs in T able 5 overlap); the evidence is in the monotone trend acr oss all K values rather than any single pairwise comparison. GSM8K provides a particularly clean test of variance reduction because canonicaliza- tion is deterministic (numeric extraction): there is no canonicalization noise, so any re- duction in mode error is pur ely attributable to consensus aggregation. HumanEval may exhibit a counterintuitive increase in mode error with larger K for items with pass rate p < 0.5 : more samples expose the dominance of the “fail” class, causing the mode to switch from a lucky single pass to the more frequent failure. This is not a failure of the method—it is a faithful r eﬂection of the agent’s true distribution. 11.4. Set size as uncertainty diagnostic. Q3: Set size reﬂects model uncertainty . Figure 4 shows prediction set size vs. consensus entropy H ( x ) = − P c ˆ p c log ˆ p c . Items with H ( x ) = 0 (all K = 10 samples agree) receive singleton prediction sets ( | S | = 1 ), while items with H ( x ) > 0 (the model disagr ees across samples) receive larger sets. The conformal prediction set naturally adapts to per-item uncertainty without requiring any calibration of an uncertainty score—set size is the un- certainty diagnostic. The synthetic validation (Appendix C , Figure 11 ) shows the same pattern under controlled conditions. GSM8K is expected to show particularly clean entropy–set-size correlation due to its deterministic canonicalization: any disagreement in the model’s answers directly reﬂects uncertainty about the numeric answer , not canonicalization noise. 11.5. Sequential stopping. T able 6 reports results for the Hoeffding-based sequential stopping rule (Theor em 9.1 , δ = 0.05 ). Across all ﬁve benchmarks, sequential stopping reduces the average number of sam- ples from K max = 20 to approximately 10 ( 45 – 52% savings), with coverage and prediction set sizes preserved exactly (T able 6 ). This conﬁrms hypothesis (H6): the Hoef fding-based stopping criterion correctly identiﬁes when the mode has stabilized, terminating early on high-conﬁdence items while using more samples on ambiguous ones. Items that consis- tently require the full K max are those near the decision boundary ( p ⋆ ≈ 0.5 ), where the Hoeffding bound cannot certify mode stability—these “unstopped” items identify a sys- tematic ambiguity zone for the model on that task. The cost implications are quantiﬁed in T able 7 . RELIABILITY CER TIFICA TION FOR AI AGENTS 27 1 2 5 10 20 K ( n u m b e r o f s e l f - c o n s i s t e n c y s a m p l e s ) 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Mode error TruthfulQA GSM8K HumanEval BigBench MML U F I G U R E 3 . V ariance reduction: mode error vs. K across all ﬁve bench- marks. T ruthfulQA and GSM8K show clear monotone decrease consistent with Theor em 4.4 . HumanEval exhibits a counterintuitive increase at low K : for items with pass rate p < 0.5 , more samples expose the dominance of the “fail” class, switching the mode from a lucky pass to the more fre- quent failur e—a theor etically predicted ef fect (Section 11.3 ), not a method failure. BigBench and MMLU show relatively ﬂat error , consistent with theory: variance reduction is most pronounced when p ⋆ is far from the decision boundary . T A B L E 6 . Sequential stopping results for GPT -4.1 ( δ = 0.05 , K max = 20 ). Coverage and set size ar e preserved while using fewer samples. 95% CIs in parentheses. Dataset A vg. K used Savings Seq. Coverage Seq. A vg. | S | HumanEval 11.05 ( 10.3–11.9 ) 44.8% 0.720 ( 0.61–0.81 ) 1.28 T ruthfulQA 9.67 ( 9.4–9.9 ) 51.7% 0.980 ( 0.96–0.99 ) 1.00 BigBench 9.92 ( 9.4–10.4 ) 50.4% 0.792 ( 0.71–0.85 ) 1.10 GSM8K 9.85 ( 9.6–10.1 ) 50.8% 0.952 ( 0.93–0.97 ) 1.00 MMLU 9.98 ( 9.7–10.3 ) 50.1% 0.834 ( 0.80–0.86 ) 1.11 11.6. Cost analysis. T able 7 reports estimated API costs per method and model. Costs are computed from the number of API calls required (items × samples per item) using published pricing for each model. The cost overhead of self-consistency sampling is K × compared to single-sample eval- uation, but sequential stopping recovers approximately half of this cost (T able 7 ). The key insight is that self-consistency is an evaluation cost , not a deployment cost: it is paid 28 CHARAFEDDINE MOUZOUNI 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 M e a n c o n s e n s u s e n t r o p y H 1.0 1.1 1.2 1.3 1.4 M e a n p r e d i c t i o n s e t s i z e | S | TruthfulQA GSM8K HumanEval BigBench MML U F I G U R E 4 . Mean pr ediction set size vs. mean consensus entr opy acr oss all ﬁve benchmarks (benchmark-level aggr egation). Benchmarks with higher average entropy (greater model disagreement) produce lar ger prediction sets. The per-item correlation underlying this aggregate pattern is con- ﬁrmed in the synthetic validation (Figure 11 ), where individual items are plotted. Error bars show 95% bootstrap CIs. once to obtain calibrated reliability estimates, not at every infer ence call. Smaller models reduce costs dramatically (GPT -4.1-nano: ∼ $2 full, ∼ $1 sequential for 2 benchmarks). 11.7. Canonicalization ablation. T able 8 compares the full canonicalized method against raw-string ranking on the two non-binary benchmarks. The ef fect of canonicalization is task-dependent (T able 8 , Figur e 5 ). BigBench shows the largest reduction ( 39.2% ), reﬂecting substantial surface-form variation in raw responses that option-matching canonicalization r esolves. MMLU shows a 21.3% reduction. T ruth- fulQA shows no differ ence because its judge-based canonicalization already produces binary labels, leaving no surface-form variation to consolidate. 11.8. Comparison with alternative nonconformity scores. W e compare our rank-based score (Deﬁnition 6.2 ) against two standar d scores from the conformal classiﬁcation liter- ature [ 7 , 17 ], all computed from the same cached K = 10 samples at zero additional cost: (1) Least Ambiguous set-valued Classiﬁer (LAC): s LAC i = 1 − ˆ P ( c ⋆ ) , where ˆ P ( c ) = coun t ( c ) /K . Prediction sets include all classes c with 1 − ˆ P ( c ) ⩽ τ . (2) Adaptive Prediction Sets (APS): s APS i = P r i j = 1 ˆ P ( c ( j ) ) + U · ˆ P ( c ( r i ) ) , where c ( j ) are classes sorted by decreasing ˆ P , r i is the rank at which the correct class appears, and U ∼ Unif ( 0, 1 ) randomizes the inclusion boundary . RELIABILITY CER TIFICA TION FOR AI AGENTS 29 T A B L E 7 . Estimated API cost (USD) per evaluation method and model. GPT -4.1 and GPT -5 mini costs are for all 5 benchmarks; other models are for the benchmarks indicated. “Full” = K max = 20 samples per item; “Se- quential” = adaptive stopping; “Single” = 1 sample; “Judge” = 1 judge call per test item. Model Full budget Sequential Single sample Judge baseline GPT -4.1 $168.01 $125.13 $4.20 $3.88 GPT -4.1-mini † $20.80 $10.40 $1.04 $2.40 GPT -4.1-nano † $1.60 $0.80 $0.08 $2.40 GPT -5 mini $33.60 $25.03 $0.84 $3.88 Llama 4 Maverick ‡ $7.84 $3.92 $0.39 $2.88 Mistral Small ‡ $2.40 $1.20 $0.12 $2.88 † 2 benchmarks only (GSM8K + MMLU). ‡ 3 benchmarks (GSM8K + MMLU + T ruthfulQA) via T ogether AI. T oken counts validated on representative API calls: ∼ 85 input and ∼ 66 output tokens per sample call (cross-benchmark average; GSM8K uses ∼ 300 total due to chain-of-thought), ∼ 215 input / ∼ 7 output per judge call. Dollar amounts above use conservative upper-bound token estimates ( 500 / 200 sample, 800 / 100 judge) rather than observed averages ( 85 / 66 and 215 / 7 ); actual costs are appr oximately 3 – 4 × lower . W e report the upper bounds for r eproducibility , as token counts vary across benchmarks and prompts. Judge always uses GPT -4.1; this means the judge cost column is constant regardless of the agent model, cr eating an asymmetry where the judge appears r elatively expensive for cheap models (e.g., GPT -4.1-nano) and cheap for expensive ones. Sequential costs use GPT -4.1 stopping proﬁle ( ¯ K ≈ 10 ) for all models. Open-weight model costs reﬂect T ogether AI serverless pricing ($0.27/$0.85 per 1M tokens for Maverick, $0.10/$0.30 for Mistral Small); self-hosted inference would further r educe sampling costs to zero. T A B L E 8 . Canonicalization ablation: average prediction set size with and without canonicalization ( α = 0.10 , GPT -4.1). Applicable to task types with non-trivial canonicalization (text and MCQ). Dataset Canonical | S | Raw | S | Reduction T ruthfulQA 1.00 1.00 0.0% BigBench 1.10 1.81 39.2% MMLU 1.11 1.41 21.3% GSM8K and HumanEval are excluded because their canonicalization is alr eady deterministic (numeric extraction and code execution, respectively). Both LAC and APS ar e continuous-valued (resolution 1 / K ), whereas our rank score is a discrete integer . Quach et al. [ 10 ] target token-level sequence generation—a fundamen- tally differ ent setting from our class-level conformal framework—so no dir ect compari- son is applicable. T able 9 reports coverage, average set size, and conditional coverage for all three scor es across the GPT -4.1 capability ladder on GSM8K and MMLU (the two benchmarks with local canonicalization). T wo complementary patterns emerge. On GSM8K ( M ⋆ = 1 for GPT -4.1 and GPT -4.1- mini), the rank score achieves the highest coverage ( 0.954 – 0.960 ), outperforming LAC 30 CHARAFEDDINE MOUZOUNI TruthfulQA BigBench MML U 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 M e a n p r e d i c t i o n s e t s i z e | S | 0 . 0 % 3 8 . 9 % 2 1 . 0 % Canonicalized R aw F I G U R E 5 . Canonicalization effect on average prediction set size. Big- Bench shows a 39.2% reduction (the lar gest effect), MMLU shows 21.3% , and T ruthfulQA shows no difference (binary judge labels leave no surface- form variation). GSM8K and HumanEval are excluded (deterministic canonicalization). T A B L E 9 . Comparison of nonconformity scores ( K = 10 , α = 0.10 ). All three scores use identical cached samples. “Cond.” = conditional cov- erage among solvable items. Best coverage per row in bold . Coverage A vg. | S | Model Dataset Rank LAC APS Rank LAC APS GPT -4.1 GSM8K 0.956 0.914 0.908 1.00 1.00 1.15 GPT -4.1 MMLU 0.980 1.000 1.000 1.11 1.13 1.13 GPT -4.1-mini GSM8K 0.954 0.900 0.902 1.00 1.00 1.12 GPT -4.1-mini MMLU 0.976 1.000 1.000 1.12 1.13 1.13 GPT -4.1-nano GSM8K 0.960 0.924 0.894 1.35 1.02 1.61 GPT -4.1-nano MMLU 0.940 1.000 1.000 1.17 1.20 1.20 Conditional coverage: on GSM8K, rank achieves ⩾ 0.977 , LAC ⩾ 0.922 , APS ⩾ 0.923 acr oss all models. On MMLU, rank achieves ⩾ 0.988 , LAC and APS both achieve 1.000 . HumanEval and T ruthfulQA ar e excluded because their binary canonicalization (pass/fail, correct/incorr ect) produces only two canonical classes, making all three scor es equivalent. BigBench uses the same MCQ canonicalization as MMLU; results ar e qualitatively identical and omitted for space. ( 0.900 – 0.924 ) and APS ( 0.894 – 0.908 ) by 3 – 7 per centage points. W ith K = 10 , the empir - ical probabilities ˆ P ( c ) have resolution 0.1 , so the continuous scores’ threshold quantile can be tight. The rank score’s discrete natur e acts as a natural r egularizer , rounding the conformal threshold conservatively . On MMLU ( M ⋆ = 2 ), LAC and APS both achieve perfect coverage ( 1.000 ) but with slight over-coverage compar ed to the 90% target, while rank stays closer to the nominal level ( 0.940 – 0.980 ). Set sizes are comparable across all three scores ( ¯ | S | ≈ 1.1 – 1.2 ). The most striking case is GPT -4.1-nano on GSM8K: APS produces the largest sets ( ¯ | S | = 1.61 ) yet the lowest coverage ( 0.894 )—the randomization in RELIABILITY CER TIFICA TION FOR AI AGENTS 31 APS adds noise rather than precision when K is small. In the controlled synthetic set- ting (Appendix C , Figur es 8 – 9 ), APS and LAC achieve lower MSE than Rank due to their ﬁner-grained continuous thresholds; however , Rank provides the widest coverage safety margin, and the distribution-speciﬁc failures observed here (under-coverage, threshold saturation) do not arise in that idealized setting. Overall, in the K ⩽ 20 regime typical of API-based evaluation, the rank score’s robustness to coarse probability estimates and real-world distribution skew makes it a pragmatic default. APS’s coverage of 0.894 falling below the 0.90 nominal level does not contradict confor- mal guarantees. Conformal coverage holds marginally —in expectation over the random calibration/test split—so any single split may fall slightly below the nominal level. At n cal = 500 , the 1 / ( n + 1 ) slack is only 0.002 , but APS’s randomization ( U ∼ Unif ( 0, 1 ) tie-breaking) introduces additional sampling variance that ampliﬁes ﬁnite-sample devi- ations, particularly when K is small and probability estimates are coarse. 11.9. Discussion. Four themes emer ge from these r esults: the calibration is accurate, the framework is honest about limitations, the patterns generalize across model families, and the reliability level pr ovides a practical deployment metric. Calibration quality is conﬁrmed by conditional coverage. How do we know the calibra- tion is working? W e compute coverage restricted to “solvable” items—those where the model produced at least one corr ect answer across all K max samples. Across all ﬁve benchmarks and all models, this conditional coverage exceeds 0.93 (T able 2 ), conﬁrming that conformal calibration is nearly perfectly calibrated whenever the model has non- zero capability . Any shortfall in mar ginal coverage comes from the model’s fundamental inability to solve certain items, not from calibration err or . W e emphasize that conditional coverage is a post-hoc diagnostic that validates the the- ory’s prediction (Theorem 7.5 (3): when the unsolvable fraction β exceeds α , M ⋆ = + ∞ ), not a guarantee available at deployment time. The “solvable” designation requir es ob- serving all K max samples, which is unavailable during calibration. For deployment deci- sions, the reliability level (Deﬁnition 2.4 , T able 10 ) provides the actionable metric: it quan- tiﬁes the conﬁdence at which mode voting suf ﬁces, directly from calibration scores, with- out requiring knowledge of which items ar e solvable. Concretely , a practitioner does not need to distinguish capability gap from calibra- tion failure before deployment—the reliability level answers the deployment question directly . HumanEval’s reliability level is 69.9% (T able 10 ). If the deployment requirement is 90% reliability , this model–task pair fails the gate. If 70% sufﬁces, it passes. The frame- work provides the actionable number; diagnosing why coverage is low (capability gap vs. calibration error) is a secondary , ofﬂine investigation using conditional coverage. The framework diagnoses capability honestly . HumanEval illustrates the boundary con- dition most clearly: coverage falls below 90% ( 0.707 ) because 26.8% of problems are unsolvable—none of the K max = 20 samples pr oduce passing code. No evaluation method can “conjure” correct answers from nothing; conformal prediction faithfully r eports this limitation. The LLM judge, by contrast, achieves an artiﬁcially high 0.915 by evaluat- ing syntactic plausibility rather than functional corr ectness—approving well-formed but incorrect solutions. Self-consistency with execution-based canonicalization gr ounds eval- uation in actual correctness rather than surface-level assessment. The pattern generalizes across model families. The capability-ladder comparison (T a- ble 3 ) conﬁrms that M ⋆ and ¯ | S | increase monotonically with decreasing capability , as predicted by Theorem 7.5 . This monotonicity holds within the GPT -4.1 family (three 32 CHARAFEDDINE MOUZOUNI T A B L E 1 0 . Reliability certiﬁcation: reliability level 1 − α ⋆ (%) for each model–task combination. Higher is better . “—” indicates the model was not evaluated on that benchmark. Boldface marks combinations where M ⋆ = 1 at α = 0.10 (mode voting alone suf ﬁces). Model GSM8K MMLU T ruthfulQA BigBench HumanEval GPT -4.1 94.6 83.4 96.8 75.4 69.9 GPT -4.1-mini 96.0 81.6 — — — GPT -4.1-nano 89.8 66.5 — — — Llama 4 Maverick 95.4 66.7 85.3 — — Mistral Small 93.8 77.8 84.8 — — Reliability decreases monotonically with capability within each family . On MMLU: GPT -4.1 ( 83.4% ) > GPT -4.1-mini ( 81.6% ) > Mistral Small ( 77.8% ) > Llama 4 Maverick ( 66.7% ) ≈ GPT -4.1-nano ( 66.5% ). Cross-family models show comparable r eliability on matched tasks (GSM8K: four of ﬁve models within 93.8 – 96.0% ; GPT -4.1-nano at 89.8% ). models on two benchmarks), and extends acr oss model families: Llama 4 Maverick and Mistral Small 24B exhibit the same qualitative patterns on GSM8K, MMLU, and T ruth- fulQA. On GSM8K, all three families achieve M ⋆ = 1 ; on T ruthfulQA, the open-weight models requir e M ⋆ = 2 (vs. 1 for GPT -4.1), consistent with their lower per-sample accu- racy . Conditional coverage on solvable items remains high acr oss all families ( ⩾ 0.949 ). The bootstrap split analysis (T able 4 ) conﬁrms that M ⋆ values are stable acr oss random data partitions. This cross-family consistency establishes that the conformal guarantees are pr operties of the method, not artifacts of a particular provider ’s output distribution. Reliability certiﬁcation provides the practical payoff. The reliability level (Deﬁnition 2.4 ) yields a single-number deployment summary for each model–task combination; T able 10 reports it across all conﬁgurations. This enables direct deployment gating: “we need X % reliability—which models qualify?” The reliability level is computed directly from calibration scores with no additional API cost. Hypothesis validation. All six hypotheses are supported (T ables 2 – 6 ): coverage holds on solvable items with conditional coverage ⩾ 0.93 across all models and benchmarks (H1); mode error decays with K on high-accuracy tasks, with T ruthfulQA dropping 35% and GSM8K dr opping 28% , though adjacent- K W ilson CIs overlap on moderate-accuracy benchmarks, so evidence for H2 rests on the monotone trend rather than pairwise sig- niﬁcance; prediction set size correlates positively with consensus entropy (H3) and de- creases with canonicalization—BigBench by 39.2% , MMLU by 21.3% (H4); conformal coverage exceeds judge accuracy where ambiguity is high, with a 6 percentage point advantage on BigBench at matched cost (H5), though we note that the BigBench compar - ison ( n test = 125 ) has wide conﬁdence intervals; and sequential stopping saves 45 – 52% of samples with zero quality loss (H6). 12. L I M I TAT I O N S (1) Marginal coverage only . Theorem 6.7 provides marginal, not conditional, coverage. For individual dif ﬁcult queries, the set may under-cover . Conditional coverage is im- possible without assumptions [ 4 , 5 ], though group-conditional methods [ 7 ] can par- tially address this. RELIABILITY CER TIFICA TION FOR AI AGENTS 33 (2) Calibration set cost. The framework requires n human-labeled calibration examples. However , the calibration set is reusable across agents and is typically much smaller than full test sets (Corollary 7.10 : n ⩾ 19 sufﬁces to achieve lower coverage error than typical judge bias). (3) Canonicalization quality and circularity . Poor canonicalization inﬂates set sizes (As- sumption 4.2 ). When an LLM judge is used for canonicalization on the same model family being evaluated (e.g., GPT -4.1 judging GPT -4.1 on T ruthfulQA), systematic self- prefer ence biases could propagate into the consensus vote. The conformal coverage guarantee remains valid regardless of canonicalization errors (set sizes gr ow to com- pensate), but variance r eduction degrades. Our cross-family results (GPT -4.1 judging Llama/Mistral) provide a clean test of this concern—see Remark 4.3 . W e recommend deterministic canonicalization wherever feasible. (4) Inﬁnite scores. When the agent never produces acceptable answers ( p ⋆ ( x ) ≈ 0 ), M ⋆ = + ∞ . This is honest but limits practical utility for very weak agents. When multiple models all yield M ⋆ = + ∞ , the framework cannot distinguish them via M ⋆ alone. The reliability level (Deﬁnition 2.4 ) partially addresses this: a model with reliability 55% is distinguishable from one at 40% , even though both have M ⋆ > 1 at α = 0.10 . For very weak models where even the reliability level is uninformative, the minimum α at which M ⋆ is ﬁnite provides additional comparative signal. (5) Conformal calibration diagnoses, does not repair . A biased agent receives large pre- diction sets, faithfully reﬂecting its limitations. The framework provides a reliable measurement of performance, not a method for impr oving it. (6) Model family scope. The capability-ladder validation uses the GPT -4.1 family (thr ee models) on two benchmarks, supplemented by GPT -5 mini on all ﬁve benchmarks and two open-weight models (Llama 4 Maverick, Mistral Small 24B) on thr ee benchmarks. While cross-family patterns ar e consistent, extending to additional architectures (e.g., Gemini, Claude) and larger model scales would further str engthen generalizability . (7) Deployment exchangeability and benchmark contamination. The conformal guar- antee r equires exchangeability between calibration and test data (Remark 6.5 ). In de- ployment, model updates between calibration and inference violate this assumption— a limitation shared by all conformal methods. Periodic recalibration is the standard mitigation; in our setting, this is inexpensive because the evaluation pipeline is fully cached and automated, and new calibration requir es only running the updated model on the ﬁxed calibration set. Additionally , all ﬁve benchmarks are public and may appear in the training data of the models tested, which could inﬂate observed accu- racy without violating the conformal guarantee (exchangeability of calibration and test items is preserved if both are drawn from the same contaminated distribution). However , the resulting reliability levels would not transfer to out-of-distribution de- ployment queries. This is not speciﬁc to our method—it applies to any benchmark- based evaluation, whether conformal, LLM-as-judge, or human-graded. Practitioners deploying on a novel domain should calibrate on queries drawn from that domain, not from public benchmarks. The calibration cost is modest (Section 1 : 50 − 100 spot- checks), making domain-speciﬁc recalibration practical. 13. C O N C L U S I O N The central contribution of this work is the reliability level —a single number that an- swers: “given any black-box AI system and any task with a small calibration set, can I 34 CHARAFEDDINE MOUZOUNI trust this system at X % conﬁdence?” The reliability level is computed directly fr om con- formal calibration scores, requir es no additional API cost beyond the evaluation itself, and generalizes across model families (T able 10 ). Theoretically , self-consistency sampling achieves exponential variance decay (Theo- rem 4.4 ), while conformal calibration ensures validity within 1 / ( n + 1 ) of the target level, independent of agent bias (Theorem 7.1 ). Remaining bias surfaces as wider prediction sets rather than hidden error (Theor em 7.5 ). Once the calibration set is large enough that the conformal slack 1 / ( n + 1 ) falls below the judge’s bias—as few as n = 19 for subjec- tive tasks with | b J | ⩾ 0.05 —the evaluation’s coverage error is smaller than the judge’s (Corollary 7.10 ), though the two appr oaches remain complementary (Remark 7.11 ). Empirically , the guarantees hold across all ﬁfteen model–task conﬁgurations tested, spanning three model families and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 in every conﬁguration, conﬁrming that all marginal under- coverage reﬂects model capability gaps, not calibration failure. Sequential stopping re- duces API costs by ∼ 50% without sacriﬁcing coverage or set size quality . Several directions merit investigation. First, conditional coverage guarantees—per- query rather than marginal validity—would strengthen deployment conﬁdence, though known impossibility results [ 4 ] impose fundamental limits. Second, active-learning strategies for calibration set construction could reduce the labeling budget below the current n ⩾ 19 threshold. Finally , extending the framework to multi-turn agent interac- tions and hybrid scores that combine model-internal probabilities with self-consistency ranking would broaden applicability . Reproducibility . All code, conﬁguration ﬁles, and result summaries are available at h ttps://gith ub.com/Cohorte- ai/trustgate- pap er . The repository includes the full em- pirical pipeline, synthetic validation experiments, and all ﬁgure-generation scripts. All API r esponses are SHA-256 cached on disk, enabling deterministic re-r uns without addi- tional API cost. Cached responses for all models and benchmarks reported in this paper are included in the release. The repository is private during review; access is available upon request to the corr esponding author . RELIABILITY CER TIFICA TION FOR AI AGENTS 35 [1] Xuezhi W ang, Jason W ei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou, Self-Consistency Improves Chain of Thought Reasoning in Language Models , arXiv preprint (2022). [2] Anastasios N. Angelopoulos and Stephen Bates, A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantiﬁcation , arXiv pr eprint (2021). [3] Vladimir V ovk, Alex Gammerman, and Glenn Shafer, Algorithmic Learning in a Random World , Springer, 2005. [4] Vladimir V ovk, Conditional V alidity of Inductive Conformal Predictors , Machine Learning 92 (2013), 349– 376. [5] Rina Foygel Barber , Emmanuel J. Cand ` es, Aaditya Ramdas, and Ryan J. T ibshirani, The Limits of Distribution-Free Conditional Pr edictive Inference , Information and Infer ence 10 (2021), no. 2, 455–482. [6] R yan J. T ibshirani, Rina Foygel Barber, Emmanuel J. Cand ` es, and Aaditya Ramdas, Conformal Pr ediction Under Covariate Shift , Advances in Neural Information Processing Systems (2019). [7] Y aniv Romano, Matteo Sesia, and Emmanuel J. Cand ` es, Classiﬁcation with V alid and Adaptive Coverage , Advances in Neural Information Processing Systems (2020). [8] Lianmin Zheng, W ei-Lin Chiang, Y ing Sheng, Siyuan Zhuang, Zhanghao W u, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, Hao Zhang, and Ion Stoica, Judging LLM-as-a-Judge with MT -Bench and Chatbot Arena , arXiv pr eprint (2023). [9] Paula Cordero-Encinar and Andrew B. Duncan, Certiﬁed Self-Consistency: Statistical Guarantees and T est- T ime T raining for Reliable Reasoning in LLMs , arXiv preprint (2025). [10] V ictor Quach, Adam Fisch, T al Schuster, Adam Y ala, Jae Ho Sohn, T ommi S. Jaakkola, and Regina Barzilay, Conformal Language Modeling , arXiv preprint (2023). [11] Karl Cobbe, V ineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Prafulla Dhariwal, Jerry T wor ek, Jacob Hilton, Reiichir o Nakano, Christopher Hesse, and John Schulman, T rain- ing V eriﬁers to Solve Math Word Problems , arXiv pr eprint (2021). [12] Eyal Even-Dar, Shie Mannor, and Y ishay Mansour, Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinfor cement Learning Problems , Journal of Machine Learning Research 7 (2006), 1079–1105. [13] W assily Hoeffding, Pr obability Inequalities for Sums of Bounded Random V ariables , Journal of the American Statistical Association 58 (1963), no. 301, 13–30. [14] E. B. W ilson, Probable Infer ence, the Law of Succession, and Statistical Inference , Journal of the American Statistical Association 22 (1927), 209–212. [15] Potsawee Manakul, Adian Liusie, and Mark J. F . Gales, SelfCheckGPT : Zero-Resource Black-Box Hallucina- tion Detection for Generative Large Language Models , arXiv pr eprint (2023). [16] Zhiyuan W ang, Jinhao Duan, Lu Cheng, Y ue Zhang, Qingni W ang, Xiaoshuang Shi, Kaidi Xu, Heng T ao Shen, and Xiaofeng Zhu, ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees , Findings of EMNLP (2024), 6886–6898. [17] Bhawesh Kumar , Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy , Ramesh Raskar, and An- drew Beam, Conformal Prediction with Large Language Models for Multi-Choice Question Answering , arXiv preprint (2023). [18] Lorenz Kuhn, Y arin Gal, and Sebastian Farquhar , Semantic Uncertainty: Linguistic Invariances for Uncer- tainty Estimation in Natural Language Generation , arXiv preprint (2023). [19] Nils Reimers and Iryna Gurevych, Sentence-BER T : Sentence Embeddings using Siamese BER T -Networks , Proceedings of EMNLP-IJCNLP (2019). [20] Leland McInnes, John Healy, and Steve Astels, hdbscan: Hierarchical density based clustering , Journal of Open Source Softwar e 2 (2017), no. 11, 205. A P P E N D I X A. C A N O N I C A L I Z AT I O N I M P L E M E N TAT I O N D E TA I L S This appendix provides full implementation details for the three canonicalization regimes summarized in Section 4.3 . A.1. Deterministic canonicalization (closed-form tasks). For tasks with structur ed answers—numbers, dates, entity IDs, labels, code outputs—the canonicalization is de- terministic: (1) Parse the raw answer into a typed representation (numeric, date, identiﬁer). 36 CHARAFEDDINE MOUZOUNI (2) Normalize units, locale formatting, and whitespace. (3) Serialize to a stable canonical string. (4) Map parse failures to a dedicated INV ALID class. This eliminates surface-form fragmentation entirely and enables exact class counts. For mathematical reasoning (e.g., GSM8K), extracting the ﬁnal numeric answer and normal- izing to a standard decimal form is suf ﬁcient. A.2. Embedding-based clustering (open-ended tasks). For free-form answers where deterministic parsing is impossible, we cluster semantically equivalent answers: (1) Compute dense embeddings e ( a i ) for each sample (e.g., via Sentence-BERT [ 19 ]). (2) Build a similarity graph with edge threshold τ cos on cosine similarity . (3) Cluster via connected components or density-based methods (e.g., HDBSCAN [ 20 ]). (4) Assign each cluster a canonical representative (medoid or LLM-generated summary). The threshold τ cos is a hyperparameter that trades off between over -merging (collaps- ing distinct answers) and under-mer ging (fragmenting equivalent answers). Both fail- ure modes affect the consensus: over-merging inﬂates the mode with incorr ect answers; under-mer ging fragments the correct class, reducing its count. A.3. LLM-assisted canonicalization. A lightweight LLM can map each raw answer to a concise canonical form before clustering: (1) Prompt a cheap, fast model to extract the “core answer ” from each raw response. (2) Optionally normalize to a closed ontology or structured schema. (3) Apply deterministic or embedding-based canonicalization on the extracted forms. This is useful when answers contain extraneous reasoning, caveats, or formatting. Be- cause LLM-assisted canonicalization can itself introduce bias (the canonicalizer may mis- interpret or truncate), we r ecommend auditing canonicalization errors on a held-out sam- ple. A.4. Stability requirements. T o ensure the consensus vote is meaningful, canonicaliza- tion must satisfy empirically veriﬁable stability conditions: (1) Bootstrap stability: resample the K answers, re-canonicalize, and measure cluster agreement via the adjusted Rand index (ARI). Requir e ARI ⩾ 0.8 for reporting. (2) Threshold sensitivity: vary τ cos over a range [ τ cos − ϵ , τ cos + ϵ ] and verify the winning canonical class is unchanged. (3) Merge/split audit: on a random sample of clusters, manually verify that merged an- swers are semantically equivalent and split answers ar e semantically distinct. These diagnostics should be reported alongside experimental results. If canonicalization is unstable, the prediction sets will be inﬂated—which is conservative (coverage is pre- served) but reduces the framework’s practical ef ﬁciency . A P P E N D I X B. N O TAT I O N S U M M A RY T able 11 summarizes the key symbols used throughout the paper . A P P E N D I X C. S Y N T H E T I C V A L I D AT I O N Before evaluating on real benchmarks, we validate the theoretical results using con- trolled synthetic agents with known parameters. This allows dir ect comparison between theoretical predictions and empirical observations without confounding from model- speciﬁc behavior or canonicalization noise. RELIABILITY CER TIFICA TION FOR AI AGENTS 37 T A B L E 1 1 . Summary of notation. Symbol Meaning X , A Query space, answer space f θ Agent with parameters θ ; stochastic map X → A A ⋆ ( x ) Set of acceptable answers for query x Canon ( x , a ) Canonicalization function mapping raw answers to canonical classes K Number of i.i.d. samples per query K max Maximum samples per query (budget) p ⋆ ( x ) Per-query pr obability of producing an acceptable answer ¯ p A verage acceptability rate E x [ p ⋆ ( x )] p canon Probability mass on the corr ect canonical class after canonicalization s i Nonconformity score for calibration item i (rank of acceptable answer) M ⋆ Conformal threshold: ⌈ ( 1 − α )( n + 1 ) ⌉ -th order statistic of { s i } S ( x ) Prediction set: top- M ⋆ canonical answers for query x | S ( x ) | Prediction set size α Miscoverage level; target coverage is 1 − α n Calibration set size b J LLM-judge bias H ( x ) Consensus entropy for query x ϕ ( x ) Consensus strength: ˆ P K ( c ( 1 ) | x ) ∆ ( x ) Consensus margin: ˆ P K ( c ( 1 ) | x ) − ˆ P K ( c ( 2 ) | x ) 1 − α ⋆ Reliability level (Deﬁnition 2.4 ) τ δ Sequential stopping time at conﬁdence 1 − δ C.1. Setup. W e construct synthetic agents as multinomial distributions over a ﬁnite set of canonical classes C = { c 1 , . . . , c L } . For each experiment, we specify the agent’s true distribution P θ ( · | x ) and run the full pipeline: sampling, ranking, conformal calibration, and prediction set construction. All experiments use 200 calibration items and 500 test items unless stated otherwise. C.2. Coverage validation (Theorem 6.7 ). W e verify that empirical coverage matches the target 1 − α across a sweep of α ∈ { 0.01, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30 } for agents with varying quality levels. Figure 6 conﬁrms that coverage is consistently at or above the target for all alpha levels, with the slight conservatism predicted by the 1 / ( n + 1 ) upper bound in ( 6.6 ). C.3. V ariance reduction (Theorem 4.4 ). W e measure mode error as a function of K for agents with known p ⋆ ∈ { 0.60, 0.70, 0.80 } and overlay the Hoeffding upper bound exp  − 2 K ( p ⋆ − 1 / 2 ) 2  . Figure 7 shows that empirical mode error decays exponentially and stays below the Hoeffding bound at all K values, validating the exponential decay predicted by Theorem 4.4 . The real-data results in Section 11.3 are consistent with this pattern. C.4. Bias–variance decomposition (Section 3 ). W e compare six evaluation methods— single sample, LLM-as-judge (simulated with known bias), self-consistency mode, and three conformal scores (Rank, LAC, APS)—on a mixture of easy ( p ⋆ = 0.75 ) and hard 38 CHARAFEDDINE MOUZOUNI 0.7 0.8 0.9 1.0 T a r g e t c o v e r a g e 1 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 Empirical coverage Coverage V alidation (Theorem 6.7) Ideal calibration T h m 6 . 7 b a n d : [ 1 , 1 + 1 n + 1 ] C o n f o r m a l ( n = 2 0 0 , K = 2 0 ) F I G U R E 6 . Synthetic coverage validation: empirical coverage vs. target 1 − α for agents with p ⋆ ∈ { 0.6, 0.7, 0.8 } . All points lie on or above the diagonal, conﬁrming Theorem 6.7 . 0 20 40 60 80 100 N u m b e r o f s a m p l e s K 1 0 6 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 ( m o d e c * ) V ariance Reduction via Consensus (Theorem 4.4) p * = 0 . 6 0 ( e m p i r i c a l ) p * = 0 . 6 0 ( H o e f f d i n g b o u n d ) p * = 0 . 7 0 ( e m p i r i c a l ) p * = 0 . 7 0 ( H o e f f d i n g b o u n d ) p * = 0 . 8 0 ( e m p i r i c a l ) p * = 0 . 8 0 ( H o e f f d i n g b o u n d ) F I G U R E 7 . Synthetic variance r eduction: mode error vs. K for thr ee agents with known p ⋆ . Dashed lines show Hoeffding upper bounds. Empirical error decays exponentially and stays below the theor etical bound. ( p ⋆ = 0.35 , mode is wrong) queries. Figures 8 and 9 show the decomposition at K = 20 and K = 10 respectively . The primary ﬁnding is that all three conformal methods reduce MSE by 50 – 200 × compared to non-conformal baselines, conﬁrming the decomposition in T able 1 . This gap dwarfs differ ences between conformal scores. RELIABILITY CER TIFICA TION FOR AI AGENTS 39 Interpreting conformal bias. The Bias 2 bars for conformal methods appear lar ger than for single-sample evaluation, but the underlying bias has a fundamentally differ ent charac- ter . For the judge and self-consistency mode, bias reﬂects over-estimation of accuracy —the method r eports higher correctness than the true p ⋆ , giving false conﬁdence. For confor- mal methods, bias reﬂects over-coverage —the prediction set contains the correct answer more often than the 1 − α target requir es (coverage annotations shown below conformal bars). Over-coverage is conservative: it errs on the side of safety . Rank’s coverage of 0.946 at K = 10 (vs. the 0.90 target) r epresents a desirable safety margin, not a deﬁciency . Score comparison. Among the conformal scor es, APS and LAC achieve lower MSE than Rank at both K values, because their continuous thresholds produce tighter calibration (less over-coverage). This is expected in a controlled synthetic setting wher e class pr ob- abilities are well-behaved. However , the Rank score provides the most conservative coverage— 0.946 at K = 10 vs. APS at 0.901 (barely above the 0.90 target)—giving the widest safety margin. The empirical results (Section 11.8 ) reveal that on real LLM distributions, the continu- ous scores’ advantage r everses: with K = 10 , APS under-covers on GSM8K ( 0.894 < 0.90 ) and both LAC and APS degenerate to 100% coverage on MMLU (threshold saturates at τ = 1 ). These distribution-speciﬁc failures—caused by extreme frequency skew and discrete answer spaces—do not appear in the controlled synthetic but are precisely the conditions that arise in API-based LLM evaluation. Single sample LLM-as- judge Self -consist. (mode) Conformal R ank (ours) Conformal LAC Conformal APS 1 0 6 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 Error component (log scale) MSE=0.2011 MSE=0.1521 MSE=0.1187 MSE=0.0062 MSE=0.0013 MSE=0.0009 cov=0.978 cov=0.926 cov=0.901 Bias- V ariance Decomposition of Evaluation Methods ( 3 0 % b i a s e d q u e r i e s , K = 2 0 , p r o b . r e s o l u t i o n 1 / K = 0 . 0 5 ) B i a s 2 V ariance F I G U R E 8 . Bias–variance decomposition at K = 20 (probability resolution 1 /K = 0.05 ). All conformal methods achieve 50 – 200 × lower MSE than non- conformal baselines. Conformal Bias 2 reﬂects over-coverage (conserva- tive; coverage values shown below bars), not estimation error . Among conformal scores, APS and LAC achieve tighter calibration due to contin- uous thresholds. C.5. Set size vs. agent quality (Theorem 7.5 ). W e vary p ⋆ from 0.3 to 1.0 and measure M ⋆ . Figure 10 conﬁrms Theorem 7.5 : M ⋆ is monotonically decreasing in agent quality , reaching M ⋆ = 1 for perfect agents ( p ⋆ = 1 ) and growing without bound as p ⋆ → 0 . C.6. Set size vs. entropy (Hypothesis H3). W e construct agents with varying entropy proﬁles and measur e the correlation between consensus entropy H ( x ) and prediction set 40 CHARAFEDDINE MOUZOUNI Single sample LLM-as- judge Self -consist. (mode) Conformal R ank (ours) Conformal LAC Conformal APS 1 0 6 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 Error component (log scale) MSE=0.2011 MSE=0.1521 MSE=0.1191 MSE=0.0024 MSE=0.0016 MSE=0.0009 cov=0.946 cov=0.927 cov=0.901 Bias- V ariance Decomposition of Evaluation Methods ( 3 0 % b i a s e d q u e r i e s , K = 1 0 , p r o b . r e s o l u t i o n 1 / K = 0 . 1 0 ) B i a s 2 V ariance F I G U R E 9 . Bias–variance decomposition at K = 10 (probability resolution 1 /K = 0.10 ), the regime typical of API-based evaluation. The conformal advantage persists. Rank pr ovides the widest coverage margin ( 0.946 vs. APS at 0.901 ); on real LLM distributions (Section 11.8 ), this conservatism prevents the coverage violations that af fect APS. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 A g e n t a c c e p t a b i l i t y r a t e p * 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 C o n f o r m a l t h r e s h o l d M * M * v s A g e n t Q u a l i t y ( = 0 . 1 0 ) M * = 1 ( p e r f e c t ) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 A g e n t a c c e p t a b i l i t y r a t e p * 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 P r e d i c t i o n s e t s i z e | S ( x ) | Set Size vs Agent Quality (Theorem 7.5) M e a n | S | 10th-90th percentile F I G U R E 1 0 . Set size M ⋆ vs. agent quality p ⋆ . Better agents r equire smaller prediction sets. The step-function shape reﬂects the discrete natur e of M ⋆ (the ⌈ ( n + 1 )( 1 − α ) ⌉ -th order statistic of integer-valued scor es). size | S ( x ) | . Figure 11 shows strong positive correlation ( r > 0.5 ), conﬁrming that pr edic- tion set size adapts to per-item uncertainty . C.7. Canonicalization effect (Proposition 4.6 ). W e simulate fragmented distributions where 6 raw answer variants (total mass p = 0.60 ) map to a single canonical class. Fig- ure 12 demonstrates that canonicalization consolidates the fragmented mass, dramati- cally reducing mode error from the raw case. This validates the ampliﬁcation result of Proposition 4.6 . O P I T – O P E N I N S T I T U T E O F T E C H N O L O G Y , A N D C O H O R T E A I , P A R I S , F R A N C E . Email address : charafeddine@cohorte.co RELIABILITY CER TIFICA TION FOR AI AGENTS 41 0.0 0.5 1.0 1.5 2.0 2.5 3.0 C o n s e n s u s e n t r o p y H ( x ) ( b i t s ) 1 2 3 4 5 6 7 A d a p t i v e p r e d i c t i o n s e t s i z e | S c p ( x ) | S e t S i z e v s E n t r o p y ( a d a p t i v e , r = 0 . 9 3 , c o v = 0 . 9 8 ) No Y es Covered F I G U R E 1 1 . Prediction set size vs. consensus entropy for synthetic agents. The strong positive correlation conﬁrms that set size adapts to per-item uncertainty . 0 10 20 30 40 50 N u m b e r o f s a m p l e s K 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 ( m o d e c * ) Canonicalization Amplifies Consensus (Proposition 4.6) N o c a n o n . ( p m a x r a w = 0 . 1 0 ) W i t h c a n o n . ( p c a n o n = 0 . 6 0 ) F I G U R E 1 2 . Canonicalization effect: mode error with and without canon- icalization for fragmented distributions. Canonicalization consolidates probability mass, r educing mode error exponentially as predicted by Proposition 4.6 .

Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment