Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

I S C O N F O R M A L F A C T U A L I T Y F O R R AG - B A S E D L L M S R O B U S T ? N OV E L M E T R I C S A N D S Y S T E M A T I C I N S I G H T S A P R E P R I N T Y i Chen Univ ersity of Wisconsin-Madison yi.chen@wisc.edu Daiwei Chen Univ ersity of Wisconsin-Madison daiwei.chen@wisc.edu Sukrut Madhav Chik odikar Univ ersity of Wisconsin-Madison chikodikar@wisc.edu Caitlyn Heqi Y in Univ ersity of Wisconsin-Madison hyin66@wisc.edu Ramya Korlakai V inayak Univ ersity of Wisconsin-Madison ramya@ece.wisc.edu March 17, 2026 A B S T R AC T Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge- intensi ve applications. Retriev al-augmented generation (RA G) and conformal factuality have emer ged as potential ways to address this limitation. While RA G aims to ground responses in retriev ed e vi- dence, it provides no statistical guarantee that the ﬁnal output is correct. Conformal factuality ﬁltering offers distribution-free statistical reliability by scoring and ﬁltering atomic claims using a threshold calibrated on held-out data, ho we ver , the informativ eness of the ﬁnal output is not guaranteed. W e systematically analyze the reliability and usefulness of conformal factuality for RA G-based LLMs across generation, scoring, calibration, robustness, and ef ﬁciency . W e propose nov el informativ eness- aware metrics that better reﬂect task utility under conformal ﬁltering. Across three benchmarks and multiple model families, we ﬁnd that (i) conformal ﬁltering suffers from low usefulness at high factuality levels due to vacuous outputs, (ii) conformal factuality guarantee is not r ob ust to distribution shifts and distr actors , highlighting the limitation that requires calibration data to closely match deployment conditions, and (iii) lightweight entailment-based veriﬁers match or outperform LLM-based model conﬁdence scorers while requiring over 100 × fewer FLOPs. Ov erall, our results expose factuality–informativeness trade-of fs and fragility of conformal ﬁltering framework under distribution shifts and distractors , highlighting the need for ne w approaches for reliability with robustness and usefulness as k ey metrics, and provide actionable guidance for b uilding RA G pipelines that are both reliable and computationally efﬁcient. Keyw ords RA G · LLM · hallucination mitigation · conformal prediction · factuality guarantee · calibration 1 Introduction Large language models (LLMs) hav e demonstrated remarkable capabilities across open-domain question answering, reasoning, and scientiﬁc discov ery [Bro wn et al., 2020, Guo et al., 2025, Zhang et al., 2025]. Y et, a persistent barrier to their reliable deployment is the phenomenon of hallucinations : outputs that are ﬂuent and conﬁdent but factually incorrect [Ji et al., 2023, Nadeau et al., 2024, Huang et al., 2025]. Such errors are not merely cosmetic. In safety-critical settings, such as medicine, la w , or ﬁnance, a single fabricated claim can erode trust, propagate misinformation, and incur high societal or ﬁnancial costs. This mak es hallucination mitigation one of the central challenges in advancing trustworthy LLMs. A rich body of work has emerged to address this challenge along two main directions: (1) retrie v al- augmented generation (RA G) and (2) conformal methods. RA G aims to reduce hallucinations by grounding responses in trusted external kno wledge sources, typically by conditioning generation on retriev ed passages [Le wis et al., 2020, Gao et al., 2023, Siriwardhana et al., 2023]. While RA G reduces the lik elihood of unsupported claims, it does not of fer Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Query x ∈ X Reference R ( x ) Response Generator G Output y Conformal Factuality Framew ork Calibration Data e X Filtered Output y ′ Input to LLM Conformal Filtering Figure 1: Overvie w of our framework. Given a query x and retrieved references R ( x ) , the Response Generator G produces an output y . The conformal factuality framew ork utilizes a separate calibration data to determine a threshold used to ﬁlter out information from the output y and yield y ′ . (See Figure 2 for the details of different stages in volved in conformal ﬁltering.) statistical guarantees on the factuality of the ﬁnal response. Even with reference, LLMs can produce hallucinations in the generated response [Huang et al., 2025]. The conformal prediction (CP) framework, on the other hand, aims to provide a statistical guarantee on the ﬁnal output, often via post-processing of the initial LLM response. CP frameworks usually ﬁrst decompose the initial LLM output into atomic claims, score each claim with a factuality scoring function, and ﬁlter those falling belo w a threshold determined using a calibration dataset [Mohri and Hashimoto, 2024, Cherian et al., 2024]. This procedure pro vides formal cov erage guarantees b ut often at the expense of informati veness, since aggressiv e ﬁltering may yield empty or vacuous outputs. Furthermore, CP ﬁltering cannot improv e the usefulness or accuracy of the LLM response; it can only remo ve hallucinations. Despite their complementary strengths, it is unclear whether conformal prediction (CP) can improve reliability of RA G-based LLMs. While sev eral recent works integrate RAG with CP methods [Li et al., 2023, Rouzrokh et al., 2024, Feng et al., 2025], they fall short of a systematic analysis that disentangles where gains come from and when guarantees break do wn. A comprehensiv e understanding of the strengths and limitations of this combination requires not only integrating the two frameworks, but also developing principled evaluation. In particular , standard metrics such as empirical factuality can exhibit multiple failure modes and obscure real utility—for example, an empty answer is trivially “factually correct” under such measures. Additionally , it remains unclear whether improved factuality necessarily requires larger , more computationally expensi ve v eriﬁers, or whether lightweight alternati ves can achie ve comparable or superior performance. This raises a scaling-law-style question for factuality ﬁltering: how do reliability and utility scale with v eriﬁer capacity and inference cost, and where are the diminishing returns? Understanding the relationship between factuality , model scale, and computational cost is critical for deplo ying reliable LLM systems in practice, where both latency and compute budgets constrain end-to-end RA G pipelines. W e address these issues in this work. Our Contributions : W e systematically inv estigate the conformal factuality ﬁltering framew ork for RA G-based LLMs (Figure 1) making the following contrib utions: • W e propose novel metrics to capture informati veness component that is often missed by traditional metrics: non-empty rate and non-vacuous empirical factuality , which jointly capture both correctness and information retention, and sufﬁcient corr ectness and conditional sufﬁcient correctness , which measure whether an output contains enough correct information to infer the ﬁnal answer to the query . These novel metrics capture the trade-of f between factual correctness and informativ eness, providing practical insights and tools for future work on hallucination mitigation. • W e conduct comprehensi ve e v aluation that spans div erse datasets (questions with free form answers, math and natural question answering), various open-source model families and sizes (with and without reasoning), and scoring functions (entailment-based and LLM-based scorers). Furthermore, we ev aluate robustness against distribution shifts and distractors, shedding light on both the capabilities and limitations of this approach. Our e v aluation highlights the key issues – limited usefulness at high factuality levels and non-r obustness to distribution shifts and distractor s . This raises the need for new approaches for guaranteeing f actuality with usefulness and rob ustness as important metrics. • In addition, we analyze the trade-offs between veriﬁcation accurac y and computational efﬁciency , demonstrating that lightweight veriﬁers can outperform lar ger LLM-based scorers while requiring orders-of-magnitude fe wer FLOPs. 2 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T 2 Problem Setting, Datasets, Models, and Metrics In this section, we describe the conformal ﬁltering frame work for RA G-based LLMs (Figure 1), introduce the scoring functions, datasets and ev aluation metrics used in this work. Let x ∈ X denote an input query . Let R ( x ) denote the reference material suf ﬁcient to answer x . W e assume the existence of an oracle retriev er that can retriev e R ( x ) such that the true answer y ⋆ is either in or can be deduced from R ( x ) . This enables us to focus on ev aluating the effecti veness of conformal ﬁltering of answers generated by RA G-based LLMs by decoupling the ef fects of speciﬁc methods used for retriev al. A response generator G , instantiated as a large language model (LLM), is prompted with both x and R ( x ) to produce an output y = G ( x, R ( x )) . The goal of the conformal ﬁltering is to create a ﬁnal output y ′ such that each statement in y ′ is factually correct at a user’ s expected lev el, e.g., 85%, while being useful in answering the query . Conformal ﬁltering method as applied to this system is described below . Conformal Filtering. Let e X denote a calibration set that is exchangeable with X . For each query ˜ x ∈ e X , a set of claims { c i } is obtained by parsing the corresponding y using a parser P . These claims are then scored by a factuality scoring function f . For each calibration query ˜ x , the candidate threshold is deﬁned as the smallest value of τ such that all claims with score abo ve τ are factual: inf { τ | ∀ c ∈ F ( τ ) , c is factual } , where F ( τ ) = { c | f ( c ) > τ } denotes the set of claims scoring abov e τ . Giv en a speciﬁed error le vel α ∈ (0 , 1) , the conformal ﬁltering frame work determines a threshold τ α from the calibration set. Speciﬁcally , τ α is chosen as the ⌈ ( n +1)(1 − α ) ⌉ n quantile of a set of candidate thresholds, where n = | e X | is the size of calibration data. The resulting guarantee is that P  ∀ c ∈ F ( τ α ) , c is factual  ≥ 1 − α . At inference time, for each query x , the generated output y is decomposed into atomic claims by a parser P , yielding C ( y ) = { c i } k i =1 . Each claim c i is scored with f , and those exceeding the threshold are retained: C ′ ( y ) = { c i | f ( c i ) > τ α , i = 1 , . . . , k } . Finally , the retained claims C ′ ( y ) are merged by a merger M into a single ﬁltered response: y ′ = M ( C ′ ( y )) . Figure 2 illustrates the entire conformal ﬁltering pipeline. W e use gpt-5-nano [gpt, 2025] in our experiments for tasks such as claim parsing, claim merging and factuality labeling 1 . Input x Reference Response Generator Output y Parser Claim 1 Claim 2 . . . Claim k Scoring function f Score 1 Score 2 . . . Score k Conformal Threshold Calibration Data e X Filtered Claim 1 Filtered Claim 2 . . . Filtered Claim k ′ Merger Filtered Output y ′ Figure 2: Giv en an input x and a reference text related to x , the Response Generator produces an output y , which is then parsed by the Parser into a list of claims. Each individual claim is subsequently scored by the Scorer , conditioned on the input x and, optionally , the reference text. These scores are passed to the conformal prediction algorithm, which ﬁlters out claims whose scores fall below a learned threshold. Finally , the remaining claims are merged into a single paragraph and returned to the user . 1 W e validate the factuality labeling quality of using a gpt-5-nano as a judge by comparing with human labeling (See Appendix A.3 for details of the human ev aluation). 3 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T 2.1 Scoring Functions Scoring function f is a core component of the conformal factuality frame work, the determination of which claims are retained or ﬁltered depends on the score. W e study two main families of scoring functions: (a) Entailment-based scorers are natural language inference (NLI) models to assess whether the reference text supports a claim. W e use both document-level and sentence-le vel variants. 1. For document-le vel entailment , the entailment score is computed directly between the entire reference R ( x ) and the claim c i . W e use the entailment model from [Laurer et al., 2022] trained on the DocNLI dataset [Y in et al., 2021]. 2. In the sentence-level entailment setting, entailment scores are computed between the claim c i and each sentence in R ( x ) , and then aggregated in two dif ferent ways: (a) Conservati ve entailment , where a claim is marked as contradictory if any sentence in the reference contradicts it. It is marked as entailed only if there are no contradictions and there is at least one sentence supporting it. (b) A verage entailment , which averages the scores of all non-neutral sentence-le vel comparisons. For sentence- lev el entailment, we use roberta-large-mnli as the entailment model [Liu et al., 2019]. (b) LLM-based model conﬁdence scorers. In this family , we prompt a language model to assign a factuality score to each claim. W e explore the design space of the prompt by varying ﬁv e dimensions: (i) the inclusion of retrieve d references R ( x ) , (ii) evidence highlighting within R ( x ) , (iii) the use of Chain-of-Thought (CoT) reasoning, (iv) output granularity for v erbalizing (continuous [0 , 1] vs. Boolean) , and (v) e v aluation consistenc y (single generation vs. av eraging ov er ﬁ ve independent generations). W e refer to this scoring function as model conﬁdence score . 2.2 Datasets W e perform ev aluations on three datasets that span open-ended summarization, mathematical reasoning, and question answering tasks. This diversity allo ws us to assess both the factual reliability and the task-lev el utility of our approach. • F ActScore dataset [Min et al., 2023] consists of 601 individuals, each paired with a W ikipedia page. Queries are: “T ell me a paragraph biography about [person] , ” where the reference R ( x ) is the W ikipedia page of the person. Because no canonical ground-truth answers are pro vided, this dataset is well-suited for e valuating factuality in open-ended generation and for testing whether models can produce faithful summaries grounded in external references. W e additionally consider F ActScore Rar e , a subset of 198 queries focusing on less well-known indi viduals. This subset probes the robustness of models when the model’ s parametric knowledge is not enough to answer the question, a regime where hallucinations are more likely , if a reference is not giv en. • MA TH dataset [Hendrycks et al., 2021] contains 12,446 competition-style mathematics problems spanning ﬁv e difﬁculty lev els and se ven categories. Each problem pro vides a question x and a ground-truth answer y ⋆ . T o construct reference materials, we prompted gpt-5-nano to generate prerequisite knowledge rele v ant to solving each problem, which serves as R ( x ) . • Natural Questions (NQ) dataset [Kwiatko wski et al., 2019] consists of 10K real-world queries collected from search engines. Each query is annotated with both a long answer and a short answer; we use the long answer as the reference R ( x ) and the short answer as the ground-truth. T ogether , these datasets provide cov erage ov er distinct capabilities: factual summarization (F ActScore), mathematical reasoning (MA TH), and reference-based question answering (NQ). 2.3 Language Models W e ev aluate our framework over sev eral open-source language models in order to systematically ev aluate different components of the factuality and RA G pipeline under varying architectures, reasoning modes, and parameter scales. Our open-source suite includes multiple families. The Qwen3 models [Y ang et al., 2025] are ev aluated both in their base form and in a reasoning-enabled variant, Qwen3-Think , where reasoning with the tag is enabled. This contrast allows us to study whether reasoning-oriented training improv es factuality scoring and ﬁltering. T o broaden architectural div ersity , we also include Llama-3.x-Instruct [Dubey et al., 2024], SmolLM2-Instruct [Allal et al., 2025], and gpt-oss [Agarwal et al., 2025]. T able 1 summarizes the model families, parameter counts, and architectures. These models are chosen to probe three orthogonal dimensions: (i) model ar chitectur e , by comparing across families; (ii) r easoning capability , by contrasting Qwen3 with Qwen3-Think ; and (iii) model scale . This diversity allows us to assess how each factor inﬂuences factuality scoring, and to highlight regimes where smaller , more efﬁcient models sufﬁce. 4 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Family Model Params Arch. Activ ated Qwen3 Qwen3-0.6B 0.6B Dense 0.6B Qwen3 Qwen3-4B 4B Dense 4B Qwen3 Qwen3-8B 8.2B Dense 8.2B Qwen3 Qwen3-30B-A3B 30.5B MoE 3.3B Qwen3 Qwen3-32B 32.8B Dense 32.8B Llama-3.x Llama-3.2-1B-Instruct 1.2B Dense 1.2B Llama-3.x Llama-3.2-3B-Instruct 3.2B Dense 3.2B Llama-3.x Llama-3.1-8B-Instruct 8B Dense 8B SmolLM2 SmolLM2-135M-Instruct 135M Dense 135M SmolLM2 SmolLM2-360M-Instruct 360M Dense 360M SmolLM2 SmolLM2-1.7B-Instruct 1.7B Dense 1.7B gpt-oss gpt-oss-20b 21B MoE 3.6B gpt-oss gpt-oss-120b 117B MoE 5.1B T able 1: Summary of ev aluated open-source language models. For dense models, the number of acti vated parameters equals the total parameter count. For MoE models, acti v ated parameter counts are shown when kno wn. 2.4 Evaluation Metrics W e ev aluate the performance using both traditional factuality measures and proposed novel metrics that are designed to capture aspects of informativ eness that existing metrics overlook. W e use the following commonly used criteria: • Empirical Factuality (EF) measures the fraction of outputs y ′ in which all retained claims C ′ ( y ) are factual. By con vention, an empty claim set C ′ ( y ) = ∅ is treated as factual, which can artiﬁcially inﬂate EF when ﬁltering is aggressiv e. Higher EF is better , as it indicates stronger factual reliability . • Po wer quantiﬁes the av erage proportion of true claims retained. Higher P ower is better , since it means fewer correct claims are lost. • Fals e P ositive Rate (FPR) measures the fraction of non-f actual claims that surviv e. Lower FPR is better , as it reﬂects stronger suppression of hallucinations. • Correctness measures the fraction of outputs equi valent to the ground-truth answer y ⋆ . Higher Correctness is better , though this metric is intentionally strict and is most applicable on datasets with unambiguous ground-truth answers. These metrics only provide limited insight into the overall usefulness of the ﬁnal answer . Each of the statements in an LLM response could be factually correct while still not being informative enough to answer the input query . Furthermore, a v acuous or empty output is factually correct by deﬁnition, but is not useful. When the input reference does contain information to provide a correct answer to the query , an empty ﬁnal answer is an indication of failure. While EF , Power , and FPR capture factuality and error rates, they fail to penalize vacuous b ut “factual” outputs. T o address this, we propose the follo wing novel evaluation metrics . • Non-empty Rate (NR) , the fraction of outputs that preserve at least one claim. Higher NR is better , rewarding informativ e responses rather than empty ones. • Non-vacuous Empirical F actuality (NvEF) , which computes EF only over non-empty outputs. Higher NvEF is better , reﬂecting factuality conditional on informati veness. • Sufﬁcient Corr ectness (SC) ev aluates whether an output to a gi ven query x (initial output y or ﬁltered output y ′ ) contains enough correct information—relativ e to a reference R ( x ) —to recov er the correct answer . Higher SC indicates better end-task utility . Unlik e Corr ectness , which can be ov erly strict (e.g., penalizing partially correct but still useful responses, or being inapplicable when there is no single canonical y ⋆ , e.g., open-ended summarization task in F ActScore), SC explicitly measures whether the content in the output is sufﬁcient to answer the query . • Conditional Sufﬁcient Correctness (CSC) restricts e v aluation of the ﬁltered outputs y ′ whose unﬁltered counterparts y already satisfy SC. A higher CSC r eﬂects str onger ﬁdelity of the ﬁltering pr ocess. CSC isolates the effect of ﬁltering from generation quality . SC on the ﬁltered output can drop either because the base model failed to include sufﬁcient information in the initial response or because ﬁltering removed it. Since CSC is conditioned on cases where the unﬁltered output has sufﬁcient information and asks whether the ﬁltered output preserves that sufﬁciency , it provides a direct measurement of whether ﬁltering maintains useful content rather than unnecessarily deleting it. NR, NvEF are claim-lev el, while correctness and (conditional) sufﬁcient correctness are at the ﬁnal task-lev el outcome. T ogether , these metrics balance factual reliability , informativeness, and task-level utility , ensuring that evaluation reﬂects not only safety (removing hallucinations) b ut also usefulness (retaining an adequate signal for the end task). 5 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T W e now design a series of experiments to systematically analyze: (i) the impact of references, (ii) the design of scoring functions, (iii) robustness to distrib utional shifts and adversarial inputs and (iv) end-to-end e valuation with a focus on ov erall computation in volved. 3 Impact of References W e begin by isolating the role of references, asking: How much do retrie ved refer ences impr ove generation quality befor e ﬁltering is even applied? T o study this, we ev aluate outputs y produced by the response generator G under two conditions: query-only generation y = G ( x ) and query-plus-reference generation y = G ( x, R ( x )) . W e then measure sufﬁcient correctness of the initial LLM response y with respect to the reference R ( x ) using gpt-5-nano 2 for dif ferent datasets: F ActScore, F ActScore-Rare, MA TH-200 (a 200-example subset of MA TH), and NQ-200 (a 200-example subset of Natural Questions). These datasets help provide insights into whether conditioning on R ( x ) improv es dif ferent types of tasks: factual summarization, reasoning, and question answering. Qwen3 0.6B Qwen3 0.6B Think Qwen3 4B Qwen3 4B Think Qwen3 8B Qwen3 8B Think Model 0.0 0.2 0.4 0.6 0.8 1.0 Suf ficient Correctness MA TH 200 Reference Y es No Qwen3 0.6B Qwen3 0.6B Think Qwen3 4B Qwen3 4B Think Qwen3 8B Qwen3 8B Think Model F ActScore Rare Qwen3 0.6B Qwen3 0.6B Think Qwen3 4B Qwen3 4B Think Qwen3 8B Qwen3 8B Think Model F ActScore Qwen3 0.6B Qwen3 0.6B Think Qwen3 4B Qwen3 4B Think Qwen3 8B Qwen3 8B Think Model NQ 200 Figure 3: Sufﬁcient correctness (SC) of Qwen3 models (0.6B, 4B, 8B) on four datasets (MA TH-200, F ActScore-Rare, F ActScore, NQ-200), with and without access to references. Across model sizes and datasets, providing references consistently improv es generation quality . Results. Figure 3 compares sufﬁcient correctness (SC) with and without references for the Qwen3 family . On F ActScore and NQ, references play a major role, highlighting their importance when the model may not have memorized all relev ant information. On MA TH, while the gap between the two settings decreases as model size increases, the SC improv es with the reference, suggesting that while larger models possess stronger reasoning abilities the reference material still plays a role. For Qwen3-0.6B , enabling think reasoning produces a large jump in SC for MA TH dataset, highlighting the role of reasoning capacity for mathematical questions. In contrast, for F ActScore and NQ, where reasoning plays a smaller role, enabling think has little eff ect. Lastly , we observe only mar ginal gains as model size increases within the Qwen3 family: the improvement in SC from 4B to 8B is much smaller than the improvement from 0.6B to 4B. Overall, providing references consistently improves generation quality . W e observe similar trends for Llama-3.x , SmolLM2 , and some frontier models (see Figures 14, 15, and 16 in Appendix A.1.1). On F ActScore-Rare, the performance of Qwen3 4B is comparable to that of Gemini 2.5 Pro and GPT -5.1 when reference is provided (Figure 16). This demonstrates that having good reference enables ev en small and medium sized LLMs to generate good outputs. The references are provided to the generator G in all subsequent experiments which focus on ef fecti veness of conformal ﬁltering. 4 Design Choices f or F actuality Scoring in Conf ormal Filtering W e now turn to the systematic e valuation of conformal ﬁltering which pro vides statistical guarantees on f actuality of the ﬁnal output. A key component of conformal ﬁltering (Figure 2) is the f actuality scoring function, which assigns a score indicating ho w well a generated output is supported by the reference. The ef fectiveness of conformal ﬁltering therefore depends critically on these scores. In this section, we systematically study se veral design choices for factuality scoring functions, including prompting strategies for LLM-based model conﬁdence score, the role of references during scoring, the choice of scorer model, and different families of scoring functions (entailment based scores compared with LLM-based scores). Through experiments across multiple datasets and model families, we identify practical conﬁgurations that improv e ﬁltering performance and calibration. 2 more details in Appendix A.1.1 6 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T 4.1 Prompting Strategies f or LLM Model Conﬁdence Scores W e begin by examining the ef fect of prompting v ariations for LLM-based scoring functions (model conﬁdence score), including: (i) highlighting supporting e vidence, (ii) enabling chain-of-thought reasoning, (iii) scalar vs. Boolean scoring for the verbalized scores, and (i v) consistency averaging (majority of multiple responses). W e run the experiments with different conﬁgurations of the above mentioned variations on F ActScore, MA TH-1K (a 1,050-example subset of MA TH), and NQ-1K (a 1,000-example subset of Natural Questions). References are alw ays provided to both the scoring function f and the LLM generator G that produces the initial response y (Figure 2) in these experiments 3 . Results. Figure 4 summarizes the effect of prompting strategies on model conﬁdence scores across three LLM families for a subset of v ariations on F ActScore dataset at le vel 1 − α = 0 . 9 . While no single strategy emer ges as universally optimal, se veral consistent patterns are observed. First, instructing models to output numeric scores reliably outperforms Boolean v alues. Second, sampling multiple responses pro vides consistent gains, indicating that aggregation reduces variance and stabilizes conﬁdence estimates. Figure 4: Evaluation of various prompting strate gies across dif ferent LLMs on the F ActScore dataset (Section 4.1) at lev el 1 − α = 0 . 9 . Results demonstrate that: (i) prompting models to generate numeric scores consistently outperforms Boolean scoring; (ii) sampling multiple responses uniformly improv es performance; howe ver , (iii) incorporating chain-of-thought reasoning or evidence highlighting do not yield reliable performance g ains across models. 4.2 Role of References on Model Conﬁdence Scor es In this section, we e v aluate how pro viding references to the LLM-based model conﬁdence scoring function impro ves scoring. Outputs y = G ( x, R ( x )) are generated using gpt-5-nano , while open-source models serve as scorers. This is because API inference is faster than running models on our own GPUs. When varying reference access, we hold all other prompt settings ﬁxed, including chain-of-thought prompting, scalar scoring, and consistenc y av eraging. Because reference access changes across conditions, we accordingly adjust whether the model is instructed to highlight supporting evidence from the reference. Experiments are conducted on F ActScore, MA TH-1K, and NQ-1K. Results. Figure 5 shows the performance under v arious metrics for the model conﬁdence scores with and without reference provided to the model conﬁdence scoring function for the MA TH-1K dataset using Qwen3-4B as the scorer . W e observe that when a reference is introduced to the model conﬁdence scoring function, the power is consistently higher than in the case where no reference is given to the scoring function. Similarly , for the non-empty rate (NR), the scoring functions with reference have an adv antage when the target factuality is of a larger size. These results show the beneﬁt of feeding the reference to the LLM-based model conﬁdence scoring function. Therefore, in the subsequent experiments, we provide a reference to model conﬁdence scoring function. W e defer the results for F ActScore and NQ-1K dataset to Appendix A.1.3. 3 Note that the suf ﬁcient correctness (SC) in this section is calculated on the ﬁltered output y ′ by prompting gpt-5-nano with the prompts in Appendix A.4.13. 7 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Figure 5: Performance of model conﬁdence score on MA TH-1K dataset with and without reference provided to scoring functions using Qwen3-4B as the LLM-based scorer . 4.3 Model Choice for LLM-based Scor ers In this section, we study the usage of different open-source models in model conﬁdence scoring functions f , using gpt-5-nano as the generator G for initial response. All prompts include references, require evidence highlighting and chain-of-thought reasoning, produce scalar scores, and apply consistency averaging. This experiment assesses ho w the scorer model family and scale af fect factuality ﬁltering. Results. Our experiments reveal that scaling LLMs used in model conﬁdence scorer does not guarantee improved conﬁdence calibration in conformal factuality , as shown in Figure 6. While among the Llama-3.x models, there is an improv ement in terms of po wer and sufﬁcient correctness with lar ger models, this trend breaks with other model families. The SmolLM2 models show no systematic beneﬁt as we increase model size; and in the case of Qwen3 , scaling ev en degrades performance. These experiments highlight that smaller models are competitive for scoring in model conﬁdence scores. This observation is practically useful especially since the model used has to be repeatedly called for scoring each claim (for each query). W e defer the results of different factuality le vel α to Appendix A.1.4. Figure 6: Experimental ev aluation across three model f amilies of various scales on the F ActScore dataset at level 1 − α = 0 . 9 . Results demonstrate that increasing model size does not consistently improve performance of conformal factuality . 8 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T 4.4 Comparison between Entailment-based and LLM-based Scoring Functions Figure 7: Comparison between entailment-based scores against the model conﬁdence score on the F ActScore dataset. Across our ev aluated settings, the entailment-based scores match or exceed the model-conﬁdence baseline, despite the entailment model being substantially smaller than the target LLMs. Beyond model conﬁdence scores, where we utilize an LLM to generate a score, we also examine entailment-based scoring functions. W e consider three different entailment-based scoring functions (Section 2.1)): document entailment score, av erage entailment score and conservati ve entailment score. W e compare conformal ﬁltering with entailment- based ﬁltering on F ActScore and MA TH-1K datasets to assess whether entailment signals offer additional adv antages in factuality ﬁltering. Results. Figure 7 presents a direct comparison between model conﬁdence scores and entailment-based scoring functions. Notably , entailment-based scores, in particular , document entailment score consistently match or exceeds the performance of model conﬁdence score , despite the entailment model being substantially smaller than the target LLMs 4 . T ogether with our analysis in Section 4.3, provide practical insight that targeted, lightweight scorers (or veriﬁers) can deli ver both computational efﬁcienc y and superior performance. W e also note that, at higher factuality levels, the pow er is quite limited across all scoring functions. W e defer the results on the MA TH-1K dataset to Appendix A.1.5. Qwen3 0.6B Qwen3 0.6B Think on Qwen3 4B Qwen3 4B Think on Qwen3 8B Qwen3 8B Think on Qwen3 30B A3B Qwen3 30B A3B Think on Qwen3 32B Qwen3 32B Think on Model 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (Conditional) Sufficient Cor r ectness dataset = F A ctScor e Qwen3 0.6B Qwen3 0.6B Think on Qwen3 4B Qwen3 4B Think on Qwen3 8B Qwen3 8B Think on Qwen3 30B A3B Qwen3 30B A3B Think on Qwen3 32B Qwen3 32B Think on Model dataset = MA TH-1K Qwen3 0.6B Qwen3 0.6B Think on Qwen3 4B Qwen3 4B Think on Qwen3 8B Qwen3 8B Think on Qwen3 30B A3B Qwen3 30B A3B Think on Qwen3 32B Qwen3 32B Think on Model dataset = NQ -1K T ype Cond. SC SC Figure 8: Sufﬁcient Correctness (SC) and Conditional Sufﬁcient Correctness (CSC) for the Qwen3 model family at α = 0 . 05 across F ActScore, MA TH-1K, and NQ-1K dataset. 4.5 Conditional Sufﬁcient Correctness of the Filter ed Output Sufﬁcient Corr ectness (SC) measures whether the output contains sufﬁcient correct information, relativ e to a reference R ( x ) , to recov er the correct answer , with higher v alues indicating stronger end-task utility . When we measure SC on the ﬁnal ﬁltered output, it can potentially conﬂate roles of conformal ﬁltering and the quality of the LLM generator G providing the initial unﬁltered output y (see Figure 1). Note that conformal ﬁltering can only remove claims in the initial output but not add to them. So, when the ﬁltered output y ′ fails to satisfy SC, it does not distinguish whether this failure is caused by the conformal ﬁltering process or by the fact that the original output y already lacked sufﬁcient 4 W e employ DeBERTa and RoBERTa to compute entailment scores, achie ving computational ef ﬁciency gains of more than two orders of magnitude compared to LLM-based conﬁdence scoring methods. For detailed comparisons of model parameters and computational complexity , see T able 2. 9 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T information. T o decouple these ef fects, we propose Conditional Sufﬁcient Corr ectness (CSC), which ev aluates suf ﬁcient correctness of the ﬁnal output y ′ conditioned on the initial output y being suf ﬁciently correct. This is done by e valuating sufﬁcient correctness only on instances where the unﬁltered output already satisﬁes SC, thereby isolating how well ﬁltering preserves v alid answers. For more detailed description on ho w SC and CSC is measured in Appendix A.1.6. Results. Figure 8 reports SC and CSC for the Qwen3 family (used in the model conﬁdence scorer) at α = 0 . 05 across F ActScore, MA TH-1K, and NQ-1K datasets. CSC consistently matches or exceeds SC across model scales and reasoning variants. The gap between CSC and SC widens on MA TH-1K and NQ-1K. Moreover , we observe that larger models do not automatically yield better SC/CSC within the Qwen3 family: the smaller Qwen3-0.6B performs as well as its 32B counterpart on all three datasets. This aligns with our ﬁndings in Section 4.3. A similar scaling trend is observed for the gpt-oss family; results are deferred to Appendix A.1.6. 5 Robustness of F actuality Scoring Functions Having examined the design and performance of dif ferent f actuality scoring functions in Section 4.4, we no w study their robustness under distrib ution shifts. While conformal ﬁltering provides the guarantee of factuality at the required le vel 1 − α , the practical usefulness depends on whether it remains reliable when key assumptions are violated. In particular, conformal ﬁltering guarantee relies on the exchangeability between the claims in the calibration set and the test data. In real-world deployments, the exchangeability assumption may fail due to distribution shifts (which can occur due to v arious reasons, e.g., the phrasing of the input query , change in the LLM used to generate the initial answer or parser) or adversarial perturbations. T o ev aluate the robustness of conformal factuality under such conditions, we perform a series of stress tests that examine ho w scoring functions behave under calibration distrib ution shifts and distractor injection. 5.1 Robustness to Calibration Distrib ution Shift W e ﬁrst study robustness to violations of the exchangeability assumption due to distrib ution shift in the test data when compared to distribution of claims in the calibration data. Conformal factuality assumes that calibration and test data are exchangeable. W e ev aluate robustness when this assumption is violated by using mismatched calibration data. T o achiev e this, we use open-source models as the generator to produce y and parse them into claims. Then, we compare the empirical factuality under two dif ferent calibration datasets: • 50 human-annotated claims from Mohri and Hashimoto [2024] ( gpt-4 -generated claims, denoted as MH). Since these claims are generated by gpt-4 , they come from a dif fer ent distribution than claims generated by the open-source model. • 50 randomly selected queries from the held-out test half for which the claims associated are generated by the open-source model, and thus the claims follow the same distrib ution as the test data. W e ev aluate empirical factuality using Qwen3-4B , SmolLM2-360M-Instruct , and Llama-3.2-3B-Instruct across F ActScore, MA TH-1K, and NQ-1K datasets. 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.4 0.6 0.8 1.0 Empirical Factuality NQ-1K Dif ferent Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) MA TH-1K Dif ferent Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) F ActScore Dif ferent Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) NQ-1K Same Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) MA TH-1K Same Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) F ActScore Same Distribution Model Confidence Score Document Entailment Score Conservative Entailment Score A verage Entailment Score Figure 9: Empirical factuality (EF) on F ActScore, MA TH-1K, and NQ-1K, under two calibration settings: (i) calibration claims generated in Mohri and Hashimoto [2024], which comes from a different distribution (ii) calibration claims drawn from the same distrib ution as the test data. W e use Qwen3-4B as the scoring function. The results show ho w the distribution shift in the calibration set af fects conformal factuality guarantees. Results. Figure 9 compares empirical factuality (EF) under different sources of calibration data, with test outputs generated and scored by Qwen3-4B . Figure 10 compares the power . When calibration data come from a dif ferent 10 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.0 0.2 0.4 0.6 0.8 1.0 Power NQ-1K Dif ferent Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) MA TH-1K Dif ferent Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) F ActScore Dif ferent Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) NQ-1K Same Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) MA TH-1K Same Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) F ActScore Same Distribution Model Confidence Score Document Entailment Score Conservative Entailment Score A verage Entailment Score Figure 10: Power on F ActScore, MA TH-1K, and NQ-1K, under two calibration settings: (i) calibration claims generated in Mohri and Hashimoto [2024], which comes from a different distrib ution (ii) calibration claims drawn from the same distribution as the test data. W e use Qwen3-4B as the scoring function. distribution (generated by gpt-4 ), EF can fall belo w the target le vel (gray dashed line), especially at higher factuality le vels. The drop is signiﬁcant for MA TH-1K. While some entailment-based scorers seem to be robust to this distrib ution shift, it comes at a steep cost on the power as sho wn in Figure 10. Furthermore, this robustness is not consistent across language models: switching G and f to Llama-3.2-3B-Instruct or SmolLM2-360M-Instruct yields different behaviors for the scoring functions that seem rob ust in the Qwen3-4B setting (see Figures 35 and 36 in Appendix A.1.7). This shows that the conformal ﬁltering is not robust to distribution shifts between the claims in the calibration set and the test set. 5.2 Robustness to Distractors Distribution shifts can also occur when LLM outputs contain irrelev ant, misleading or hallucinated content. LLM outputs may include claims that appear plausible yet are factually incorrect. This can arise because LLMs are susceptible to being distracted by irrele v ant information in the input [Shi et al., 2023]. T o simulate such conditions, we replace a proportion of factual claims in each test query with distractor statements generated by gpt-5-nano . Speciﬁcally , for each query x , its associated reference text R ( x ) , and the set of claims { c i } , we prompt the LLM to modify each c i , conditioning on x and R ( x ) , so that the result reﬂects a type of hallucination the model could plausibly produce. The prompt design used to generate these hallucinated claims is documented in Appendix A.4.6. Our goal is to create hallucinated claims that are sufﬁciently con vincing to the model, such that it would judge them as potentially originating from itself. T o ensure this, after we generate a hallucinated claim, we ask the LLM to check if it thinks that the claim might be generated or hallucinated by itself (prompt in Appendix A.4.7), given the same x, R ( x ) . If the hallucination claim can cause the model to think that it is the one who generates it, given x and R ( x ) , then we keep this hallucination claim. Otherwise, we repeat this process and generate a new hallucination claim. W e ev aluate whether the different scoring functions used in the conformal factuality framework can reliably distinguish correct claims from these plausible but incorrect claims. Experiments are performed on the F ActScore, MA TH-1K, and NQ-1K datasets. Results. Figure 11 (a)-(d) show that as the distractor rate increases, empirical factuality (EF) drops sharply . This degradation occurs because adding distractors to the test set violates the exchangeability assumption underlying the conformal factuality frame work, causing EF to fall belo w the tar get le vel. Although EF can increase when the target factuality is set v ery high, this comes at the cost of a substantial loss in power , as shown in Figure 11 (e). Our results indicate that the conformal ﬁltering with current scoring functions is not robust to distractors, underscoring the need for improv ed scoring functions that maintain robustness in their presence. 5.3 Can Distraction-A ware Thr eshold Help? A natural attempt to mitigate the ef fect of distractors is to anticipate their presence by using distraction-a ware calibration. T o study the efﬁcac y of this approach, we extend the pre vious setting by introducing distractors not only into the test set but also into the calibration set. This models potential distribution shifts caused by distractor content and ev aluates whether conformal ﬁltering remains reliable when calibration data include such claims. Since the true lev el of distractors in practice is unkno wn, we v ary the amount of distractors in calibration set k eeping the fraction of distractors in the 11 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Empirical F actuality (a) T est: 0.0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (b) T est: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (c) T est: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (d) T est: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 P ower (e) T est: 0.1 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 11: (a) - (d) Empirical f actuality versus tar get factuality (1- α ) for Qwen3-4B on the F ActScore dataset. Each panel corresponds to a dif ferent distractor injection rate in the test set (0.0, 0.1, 0.25, 0.5). The gray dashed diagonal represents perfect calibration ( y = x ), where empirical factuality matches the target. (e) Power versus tar get factuality (1 - α ) when the injection rate is 0.1. test sets to 0 . 25 . This setup enables us to assess both under-estimation and over -estimation of distractor prev alence. Experiments are conducted on F ActScore, MA TH-1K, and NQ-1K datasets. 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Empirical F actuality (a) T est: 0.25, Calib.: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Empirical F actuality (b) T est: 0.25, Calib.: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Empirical F actuality (c) T est: 0.25, Calib.: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (d) T est: 0, Calib.: 0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (e) T est: 0.25, Calib.: 0.25 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 12: (a) - (c) Empirical factuality versus tar get factuality ( 1 − α ) for Qwen3-4B on the F ActScore dataset when the test set is injected with 25% distractors. W e vary the proportion of distractors in the calibration set from (0.1, 0.25, 0.5). As the proportion matches, we can see that the empirical f actuality rises to the y = x line. (d)-(e) Comparison of non-empty rate when both the test and calibration sets contain no distractors and contain 25% of the distractors. Although introducing distractors to the calibration set can achiev e target factuality , the non-empty rate suffers. Results. In Figure 12, we show the result for the setting with a test set with a distractor proportion of 0.25 and varying the lev els of distractors in the calibration set. When the fraction of distractors is underestimated, the EF is still far below the target EF (Figure 12 (a)). From Figure 12 (b)-(c), we can see that introducing a large enough fraction of distractors to the calibration set can bring up the empirical factuality . Howe ver , we note that this incurs a high cost on the non-empty rate. As we see in Figure 12 (d)-(e), the non-empty rate drops signiﬁcantly when distractors are introduced to the calibration set. When both calibration and test contain no distractors, we have a much higher non-empty rate compared to the case where we inject 25% distractors into both the calibration and test sets. This happens likely due to the thresholds found by the conformal ﬁltering frame work becoming more stringent. Therefore, the scoring functions cannot distinguish the distractors that are factually incorrect from the factually correct claims. Overall, the experiments on robustness show that the conformal ﬁltering framework is not robust to distribution shifts and distractors. This calls attention to the need for re-thinking factuality guarantees for LLM outputs with robustness as an important criteria. 6 Efﬁciency Evaluation of the Conf ormal Factuality Pipeline End-to-End Providing factuality guarantees on the output of an LLM requires additional inference and therefore is necessarily increases the e xpense of the ﬁnal response. In this section, we ev aluate the complete end-to-end pipeline, jointly considering generation, scoring, and conformal ﬁltering with a particular focus on efﬁcienc y measured by FLOPs. W e use gpt-oss-20b , a mixture-of-experts (MoE) model with 3.6B activ e parameters, as the response generator G and consider two options for the scoring function f : (i) using gpt-oss-20b itself as the scorer , reﬂecting the scenario in which the same model is av ailable for both generation and scoring; and (ii) using Qwen3-8B as the scorer , allo wing us to examine whether a dense model can serve as an alternati ve to a FLOPs-ef ﬁcient MoE model. Finally , we compare these 12 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T LLM-based model conﬁdence scorers with entailment-based scoring while analyzing the associated computational costs. All experiments in this section are conducted on the F ActScore dataset. Model T otal T okens T otal FLOPs gpt-oss-20b (3.6B activ e) [Agarwal et al., 2025] 2000 1 . 44 × 10 13 Qwen3-8B (8.19B activ e) [Y ang et al., 2025] 2000 3 . 28 × 10 13 DeepSeek-R1 (37B activ e) [DeepSeek-AI, 2025] 2000 1 . 5 × 10 14 DeBERTa (184M activ e) [He et al., 2020] 2000 4 . 9 × 10 11 RoBERTa (356M activ e) [Liu et al., 2019] 2000 1 . 6 × 10 12 T able 2: Estimated FLOPs for generating 1000 tokens with a 1000-token prompt (assuming KV caching). 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.0 0.5 1.0 Empirical Factuality 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) Power 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) False Positive Rate 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) Non-Empty Rate 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) Non-V acuous Empirical Factuality 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) Suf ficient Correctness A verage Score Conservative Score Document Entailment Score Model Confidence Score (GPT -OSS-20B) Model Confidence Score (Qwen3-8B) y = x Figure 13: Performance of entailment- and conﬁdence-based scorers using gpt-oss-20b and Qwen3-8B on F ActScore. Results. Figure 13 sho ws that conﬁdence-based scoring with gpt-oss-20b achiev es higher power and suf ﬁcient correctness than Qwen3-8B , particularly at moderate target factuality levels. Despite activ ating substantially fewer parameters, gpt-oss-20b consistently outperforms the dense Qwen3-8B , highlighting the advantages of the MoE architecture for scalable factuality ﬁltering. T able 2 further underscores this efﬁciency–performance trade-of f. Notably , the document entailment score based on DeBERTa lies at an even more fav orable point on this frontier . It is ov er 100 × more computationally ef ﬁcient than gpt-oss-20b , yet achiev es comparable non-empty rates and sufﬁcient correctness. This suggests that lightweight entailment models can serve as str ong surr ogates for lar ge LLM-based scor ers when computational b udgets are constrained, without substantial losses in factuality performance. T ogether , these results demonstrate that both parameter-efﬁcient MoE models and compact entailment-based scorers of fer compelling alternativ es to dense large models, enabling end-to-end factuality e valuation that is not only more effecti ve but also dramatically more economical. 7 Related W orks Many studies sho w that LLMs are prone to hallucinations [Nadeau et al., 2024, Huang et al., 2025] despite their impressiv e capabilities in summarization, dialogue, and coding [Achiam et al., 2023, Zhang et al., 2024, Nam et al., 2024]. RA G and conformal prediction based ﬁltering methods hav e emerged as prominent methods to mitigate hallucinations and provide factuality guarantees. 7.1 Retrieval-A ugmented Generation Retriev al-augmented generation (RA G) improves LLM performance on kno wledge-intensi ve tasks by grounding responses in retriev ed external conte xt [Lewis et al., 2020, Joren et al., 2025, Gao et al., 2023]. F ormally , giv en a query x , a retrie ver R returns context R ( x ) that supplements the model’ s parametric kno wledge, guiding generation to ward more factually correct o utputs. In this work, we assume access to an oracle retriev er that always provides relev ant, accurate references. While RA G is powerful, it does not pro vide statistical guarantees on the f actuality of its outputs, and generated responses can still contain hallucinations [Huang et al., 2025]. In addition, LLMs may not utilize references ef fectively , especially information appearing in the middle of the context— a phenomenon known as lost-in-the-middle [Liu et al., 2023, Rav aut et al., 2023, Chen et al., 2023, T ang et al., 2023]. 13 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T This may be exacerbated by positional encoding mechanisms such as rotary positional embeddings (RoPE) [Su et al., 2024, Huang et al., 2025]. Another contributing factor is that, in pretraining data, salient information is often located near the beginning or the end of a document rather than in the middle [Rav aut et al., 2023, Huang et al., 2025]. The metric sufﬁcient context is proposed in [Joren et al., 2025] to study the usefulness of the retrieved reference in a RA G-based LLMs. In contrast to this metric which focuses on the quality of retrie v al, suf ﬁcient correctness and conditional sufﬁcient correctness proposed in our work focuses on the quality of output from the RAG-based LLM measured with respect to the reference. 7.2 Conformal Pr ediction Conformal prediction provides another promising mitigation strategy by ﬁltering non-factual content from LLM outputs [V ovk et al., 2005, Angelopoulos and Bates, 2021, Mohri and Hashimoto, 2024, Cherian et al., 2024]. These methods not only improve factuality but also of fer a statistical guarantee, e.g., P ( output is factual ) ≥ 1 − α . For instance, Mohri and Hashimoto [2024] introduced conformal factuality , which scores claims in the output and removes those belo w a threshold calibrated on held-out data. The choice of scoring function f is therefore critical to the effecti veness of the framew ork. 7.3 Conformal Pr ediction and RA G Recent work has begun integrating conformal prediction into retriev al-augmented generation (RA G) to pro vide statistical reliability guarantees. TRA Q [Li et al., 2023] applies conformal prediction at both the retrie ver and generator stages, ensuring with high probability that a semantically correct answer is included in the output set. Conformal-RAG [Feng et al., 2025] instead operates at the sub-claim lev el, ﬁltering unreliable statements to guarantee factuality across domains. Conﬂare [Rouzrokh et al., 2024] focuses on the retriev al stage, calibrating similarity thresholds so that retrie ved contexts contain the true answer with user-speciﬁed conﬁdence. While these approaches provide important co verage guarantees at different stages of the pipeline, they lar gely assess correctness in isolation. In contrast, our work provides a systematic analysis of the performance of conformal ﬁltering for guaranteeing factuality of RAG-based LLMs by introducing new metrics—non-empty rate, non-vacuous empirical factuality , sufﬁcient correctness and conditional sufﬁcient correctness—that explicitly capture the trade-of f between correctness and informati veness, which existing frame works do not address. Furthermore, we perform robustness analysis that re veals the fragility of conformal ﬁltering under distribution shifts and in the presence of distractors. 8 Conclusion W e systematically in vestigate conformal ﬁltering framew ork used in guaranteeing factuality of RA G-based LLM outputs. Our experiments e xtensi vely study the importance of references for generation and scoring, various scoring functions, sensitivity to calibration data, rob ustness to distractor-induced hallucinations and the efﬁciency of the end-to-end pipeline. A key limitation we uncover is that standard f actuality measures can ov erstate practical progress: because they primarily re ward absence of incorr ect content , the y can be optimized by ﬁltering systems that abstain (returning empty answers) or produce generic, non-committal responses that are technically “factual” yet unhelpful even when the input reference has the information to answer the question. As a result, high empirical factuality may coincide with low end-task utility , obscuring the correctness–informati veness trade-of f that real deployments must na vigate. T o address these limitations, we introduced no vel metrics— non-empty r ate , non-vacuous empirical factuality , and (conditional) sufﬁcient corr ectness —that explicitly measure whether ﬁltered outputs remain informative and suf ﬁcient for answering the query along with empirical factuality . Our experiments span three datasets for di verse tasks – F ActScore for open-ended summarization, MA TH for mathemat- ical queries, and Natural Questions and multiple model families, rev ealing se veral insights. In particular, we sho w that stronger factuality does not require lar ger or more e xpensi ve veriﬁers: lightweight entailment-based veriﬁers consistently outperform LLM-based conﬁdence scorers while requiring orders of magnitude fewer FLOPs. Our comprehensive analysis of robustness re veals that the current conformal ﬁltering approaches are not robust under distrib ution shift and distractors, which is an important limitation for practical usage in safety critical settings. Overall, our ﬁndings pro vide actionable guidance for b uilding reliable and ef ﬁcient RA G systems and underscore the need to rethink ho w factuality in LLMs is measured, enforced, and optimized under realistic deployment constraints with a particular focus on robustness and usefulness. 14 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T References T om Brown, Benjamin Mann, Nick Ryder , Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry , Amanda Askell, et al. Language models are few-shot learners. Advances in neural information pr ocessing systems , 33:1877–1901, 2020. Daya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, Peiyi W ang, Qihao Zhu, Runxin Xu, Ruo yu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature , 645(8081): 633–638, 2025. Y anbo Zhang, Sumeer A Khan, Adnan Mahmud, Huck Y ang, Alexander Lavin, Michael Levin, Jeremy Frey , Jared Dunnmon, James Evans, Alan Bundy , et al. Exploring the role of large language models in the scientiﬁc method: from hypothesis to discov ery . npj Artiﬁcial Intelligence , 1(1):14, 2025. Ziwei Ji, Nayeon Lee, Rita Friesk e, T iezheng Y u, Dan Su, Y an Xu, Etsuko Ishii, Y e Jin Bang, Andrea Madotto, and Pascale Fung. Surve y of hallucination in natural language generation. ACM computing surve ys , 55(12):1–38, 2023. David Nadeau, Mik e Kroutiko v , Karen McNeil, and Simon Baribeau. Benchmarking llama2, mistral, gemma and gpt for factuality , toxicity , bias and propensity for hallucinations. arXiv pr eprint arXiv:2404.09785 , 2024. Lei Huang, W eijiang Y u, W eitao Ma, W eihong Zhong, Zhangyin Feng, Haotian W ang, Qianglong Chen, W eihua Peng, Xiaocheng Feng, Bing Qin, et al. A surve y on hallucination in large language models: Principles, taxonomy , challenges, and open questions. ACM T ransactions on Information Systems , 43(2):1–55, 2025. Patrick Le wis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler , Mike Le wis, W en-tau Y ih, T im Rocktäschel, et al. Retrie val-augmented generation for kno wledge-intensive nlp tasks. Advances in neural information pr ocessing systems , 33:9459–9474, 2020. Y unfan Gao, Y un Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Y uxi Bi, Y ixin Dai, Jiawei Sun, Haofen W ang, and Haofen W ang. Retrie val-augmented generation for large language models: A survey . arXiv preprint arXiv:2312.10997 , 2(1), 2023. Shamane Siriwardhana, Rivindu W eerasekera, Elliott W en, Tharindu Kaluarachchi, Rajib Rana, and Suranga Nanayakkara. Improving the domain adaptation of retriev al augmented generation (rag) models for open domain question answering. T ransactions of the Association for Computational Linguistics , 11:1–17, 2023. Christopher Mohri and T atsunori Hashimoto. Language models with conformal factuality guarantees. In Pr oceedings of the 41st International Confer ence on Machine Learning , ICML ’24. JMLR.org, 2024. John Cherian, Isaac Gibbs, and Emmanuel Candes. Large language model validity via enhanced conformal prediction methods. Advances in Neural Information Pr ocessing Systems , 37:114812–114842, 2024. Shuo Li, Sangdon Park, Insup Lee, and Osbert Bastani. Traq: Trustworth y retriev al augmented question answering via conformal prediction. arXiv preprint , 2023. Pouria Rouzrokh, Shahriar Faghani, Cooper U Gamble, Moein Shariatnia, and Bradle y J Erickson. Conﬂare: conformal large language model retrie val. arXiv preprint , 2024. Naihe Feng, Y i Sui, Shiyi Hou, Jesse C Cresswell, and Ga W u. Response quality assessment for retrie v al-augmented generation via conditional conformal factuality . In Pr oceedings of the 48th International ACM SIGIR Confer ence on Resear ch and De velopment in Information Retrieval , pages 2832–2836, 2025. Aug 2025. URL https://openai.com/index/introducing- gpt- 5/ . Moritz Laurer , W outer van Atte veldt, Andreu Salleras Casas, and Kasper W elbers. Less annotating, more classifying – addressing the data scarcity issue of supervised machine learning with deep transfer learning and BER T -NLI, June 2022. URL https://osf.io/74b8k . Preprint, Open Science Frame work. W enpeng Y in, Dragomir Radev , and Caiming Xiong. Docnli: A large-scale dataset for document-le vel natural language inference. arXiv preprint , 2021. Y inhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy , Mike Lewis, Luk e Zettlemoyer , and V eselin Stoyanov . Roberta: A rob ustly optimized bert pretraining approach. arXiv pr eprint arXiv:1907.11692 , 2019. Sew on Min, Kalpesh Krishna, Xinxi L yu, Mike Lewis, W en-tau Y ih, Pang W ei K oh, Mohit Iyyer, Luk e Zettlemoyer , and Hannaneh Hajishirzi. Factscore: Fine-grained atomic ev aluation of factual precision in long form text generation. arXiv pr eprint arXiv:2305.14251 , 2023. Dan Hendrycks, Collin Burns, Saurav Kadav ath, Akul Arora, Ste ven Basart, Eric T ang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv pr eprint arXiv:2103.03874 , 2021. 15 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T T om Kwiatkowski, Jennimaria Palomaki, Olivia Redﬁeld, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob De vlin, K enton Lee, et al. Natural questions: a benchmark for question answering research. T ransactions of the Association for Computational Linguistics , 7:453–466, 2019. An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bo wen Y u, Chang Gao, Chengen Huang, Chenxu Lv , et al. Qwen3 technical report. arXiv pr eprint arXiv:2505.09388 , 2025. Abhimanyu Dubey , Abhina v Jauhri, Abhina v Pandey , Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur , Alan Schelten, Amy Y ang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints , pages arXiv–2407, 2024. Loubna Ben Allal, Anton Lozhkov , Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis T unstall, Andrés Maraﬁoti, Hynek K ydlí ˇ cek, Agustín Piqueres Lajarín, V aibhav Sri vastav , et al. Smollm2: When smol goes big–data-centric training of a small language model. arXiv preprint , 2025. Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Y u Bai, Bowen Bak er , Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint , 2025. Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelev ant context. In International Confer ence on Machine Learning , pages 31210–31227. PMLR, 2023. DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948 . Pengcheng He, Xiaodong Liu, Jianfeng Gao, and W eizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint , 2020. Josh Achiam, Ste ven Adler , Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv pr eprint arXiv:2303.08774 , 2023. Y ang Zhang, Hanlei Jin, Dan Meng, Jun W ang, and Jinghua T an. A comprehensiv e survey on process-oriented automatic text summarization with e xploration of llm-based methods. arXiv preprint , 2024. Daye Nam, Andre w Macv ean, V incent Hellendoorn, Bogdan V asilescu, and Brad Myers. Using an llm to help with code understanding. In Pr oceedings of the IEEE/A CM 46th International Confer ence on Softwar e Engineering , pages 1–13, 2024. Hailey Joren, Jian yi Zhang, Chun-Sung Ferng, Da-Cheng Juan, Ankur T aly , and Cyrus Rashtchian. Sufﬁcient context: A new lens on retriev al augmented generation systems. In The Thirteenth International Confer ence on Learning Repr esentations , 2025. URL https://openreview.net/forum?id=Jjr2Odj8DJ . Nelson F Liu, Ke vin Lin, John Hewitt, Ashwin Paranjape, Michele Be vilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long conte xts. arXiv preprint , 2023. Mathieu Ravaut, Aixin Sun, Nancy F Chen, and Shaﬁq Joty . On context utilization in summarization with lar ge language models. arXiv preprint , 2023. Hung-T ing Chen, Fangyuan Xu, Shane Arora, and Eunsol Choi. Understanding retriev al augmentation for long-form question answering. arXiv preprint , 2023. Raphael T ang, Xinyu Zhang, Xueguang Ma, Jimmy Lin, and Ferhan T ure. Found in the middle: Permutation self-consistency impro ves listwise ranking in lar ge language models. arXiv pr eprint arXiv:2310.07712 , 2023. Jianlin Su, Murtadha Ahmed, Y u Lu, Shengfeng Pan, W en Bo, and Y unfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neur ocomputing , 568:127063, 2024. Vladimir V ovk, Alexander Gammerman, and Glenn Shafer . Algorithmic learning in a r andom world , v olume 29. Springer , 2005. Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distrib ution-free uncertainty quantiﬁcation. arXiv preprint , 2021. Gemini T eam, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Y u, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv pr eprint arXiv:2312.11805 , 2023. 16 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A A ppendix This appendix provides supplementary material that supports and extends the analyses presented in the main paper . It is organized as follo ws. Appendix A.1 presents e xtended experimental results that complement the main ﬁndings. Section A.1.1 broadens the analysis of how retrie ved references improv e generation quality (Section 3) to additional model families ( Llama-3.x , SmolLM2 ) and frontier models. Section A.1.2 provides the full set of prompting-strategy comparisons across all target factuality levels, extending the summary in Section 4.1. Section A.1.3 expands the reference-in-scoring analysis (Section 4.2) across four model families and three datasets. Section A.1.4 supplements the scorer scaling study (Section 4.3) with detailed radar plots at multiple factuality targets. Section A.1.6 reports suf ﬁcient correctness and conditional sufﬁcient correctness for model families beyond Qwen3 (Section 4.5). Section A.1.7 and Section A.1.8 extend the rob ustness analyses of Sections 5.1 and 5.2 to additional scoring functions and model families. Section A.1.9 complements Section 5.3 with full results on distraction-aware calibration. Append ix A.2 details ho w adversarial distractor claims are generated and veriﬁed, supporting the experimental protocol in Section 5.2. Appendix A.3 reports the human e valuation used to v alidate gpt-5-nano as a factuality judge. Appendix A.4 collects all prompts used throughout our e xperiments, including those for generation, parsing, labeling, scoring, merging, correctness e valuation, and distractor creation. A.1 Extended Results The following subsections present extended experimental results that complement the main-paper analyses. Each subsection identiﬁes the corresponding main-paper section and provides additional ﬁgures covering model f amilies, datasets, or conﬁgurations not shown in the main te xt. A.1.1 Impact of References Section 3 of the main paper demonstrates that providing retriev ed references to the response generator consistently improv es suf ﬁcient correctness, using the Qwen3 family as the primary illustration. Here we ﬁrst describe ho w sufﬁcient correctness is measured and then verify that this ﬁnding generalizes across model families and scales. Measuring sufﬁcient corr ectness F or each original generated output y , we use gpt-5-nano with the prompt provided in Appendix A.4.13 to assess whether y is suf ﬁcient correct. Concretely , we replace the {response} placeholder in the prompt with the string representation of y . At the same time, we replace {query} and {reference} with the corresponding query and reference, respectiv ely . Llama-3.x family . Figure 14 extends the reference-impact analysis to the Llama-3.x family . Consistent with the Qwen3 results, all three model sizes—1B, 3B, and 8B—sho w clear gains in sufﬁcient correctness when references are provided, across all four datasets. The improv ement is especially pronounced on F ActScore and NQ-200, where parametric knowledge alone is insuf ﬁcient. Llama-3.2 1B Instruct Llama-3.2 3B Instruct Llama-3.1 8B Instruct Model 0.0 0.2 0.4 0.6 0.8 1.0 Suf ficient Correctness MA TH 200 Reference Y es No Llama-3.2 1B Instruct Llama-3.2 3B Instruct Llama-3.1 8B Instruct Model F ActScore Rare Llama-3.2 1B Instruct Llama-3.2 3B Instruct Llama-3.1 8B Instruct Model F ActScore Llama-3.2 1B Instruct Llama-3.2 3B Instruct Llama-3.1 8B Instruct Model NQ 200 Figure 14: Suf ﬁcient correctness (SC) of Llama-3.x models on four datasets (MA TH-200, F ActScore-Rare, F ActScore, NQ-200), with and without access to references. Across model sizes and datasets, providing references consistently improv es generation quality . 17 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T SmolLM2 family . Figure 15 presents the same comparison for the SmolLM2 family . Despite their substantially smaller parameter counts (135M–1.7B), these models also beneﬁt from references, although the absolute level of suf ﬁcient correctness remains lower than that of the larger fa milies. This conﬁrms that reference grounding is beneﬁcial ev en at very small scales. SmolLM2 135M Instruct SmolLM2 360M Instruct SmolLM2 1.7B Instruct Model 0.0 0.2 0.4 0.6 Suf ficient Correctness MA TH 200 Reference Y es No SmolLM2 135M Instruct SmolLM2 360M Instruct SmolLM2 1.7B Instruct Model F ActScore Rare SmolLM2 135M Instruct SmolLM2 360M Instruct SmolLM2 1.7B Instruct Model F ActScore SmolLM2 135M Instruct SmolLM2 360M Instruct SmolLM2 1.7B Instruct Model NQ 200 Figure 15: Sufﬁcient correctness (SC) of SmolLM2 models on four datasets (MA TH-200, F ActScore-Rare, F ActScore, NQ-200), with and without access to references. Across model sizes and datasets, providing references consistently improv es generation quality . Frontier model comparison. Finally , Figure 16 compares the Qwen3 family against two frontier models— gemini-2.5-pro [T eam et al., 2023] and gpt-5.1 [gpt, 2025]—on the F ActScore-Rare dataset. A note worthy ﬁnding is that Qwen3-4B , when giv en a reference, achieves suf ﬁcient correctness comparable to these frontier models. This underscores that, with proper retriev al augmentation, e ven moderately sized models can match frontier -le vel factual accuracy on kno wledge-intensive tasks. Qwen3 0.6B Qwen3 0.6B Think Qwen3 4B Qwen3 4B Think Qwen3 8B Qwen3 8B Think Gemini 2.5 Pro GPT -5.1 Model 0.0 0.2 0.4 0.6 0.8 1.0 Suf ficient Correctness (a) F ActScore Rare Figure 16: Sufﬁcient correctness (SC) of Qwen3 and frontier models on F ActScore-Rare, with and without access to references. Pro viding references consistently improves generation quality; notably , Qwen3-4B with references is comparable to frontier models. 18 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A.1.2 Prompting Strategies f or Scorers Figure 4 in Section 4.1 summarizes the effect of prompting strategies on LLM-based scoring functions using three representativ e conﬁgurations per model. Here we present the full set of comparisons at ﬁv e target factuality lev els ( 1 − α ∈ { 0 . 75 , 0 . 8 , 0 . 85 , 0 . 9 , 0 . 95 } ), enabling a more granular assessment of how each prompting dimension— evidence highlighting, chain-of-thought reasoning, scalar vs. Boolean scoring, and consistenc y a veraging—interacts with the target f actuality . F ActScore dataset. Figure 17 shows radar plots for all 16 prompting strategies across four LLMs on F ActScore. The plots conﬁrm the main-paper ﬁnding that numeric scoring and consistency a veraging are the most reliably beneﬁcial dimensions, while chain-of-thought and highlighting provide inconsistent gains that vary by model and factuality target. MA TH-1K dataset. Figure 18 repeats this analysis on MA TH-1K. On mathematical reasoning tasks, the relativ e advantage of numeric scoring is ev en more pronounced, likely because scalar conﬁdence values better capture the degree of certainty in multi-step deri vations than binary labels. 19 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Figure 17: Overall performance comparison of model conﬁdence scores across 16 prompting strate gies at ﬁ ve tar get factuality le vels ( 1 − α ) on the F ActScore dataset. Each column corresponds to a different LLM scorer; each row corresponds to a different target factuality . Numeric scoring and consistency averaging yield the most rob ust improv ements. 20 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Figure 18: Overall performance comparison of model conﬁdence scores across various prompting strategies at ﬁve target f actuality lev els ( 1 − α ) on the MA TH-1K dataset. The adv antage of numeric ov er Boolean scoring is especially clear for mathematical reasoning. 21 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A.1.3 Role of References in Scoring Section 4.2 demonstrates that providing references to the scoring function impro ves power and non-empty rate, using Qwen3-4B on MA TH-1K as the primary example. Here we extend this analysis to additional scorer models and datasets to assess how broadly the beneﬁt holds. Qwen3-4B across datasets. Figures 19 and 20 sho w the reference ef fect for Qwen3-4B on F ActScore and NQ-1K, respectiv ely . On both datasets, pro viding the reference to the scorer consistently improv es po wer and non-empty rate, mirroring the MA TH-1K results in the main paper . The gains are largest at high target factuality , where the scoring function must be most discriminating. Figure 19: Performance of model conﬁdence score on F ActScore with and without reference pro vided to scoring functions, using gpt-5-nano as generator and Qwen3-4B as scorer . Reference access improves po wer and non-empty rate. Figure 20: Performance of model conﬁdence score on NQ-1K with and without reference provided to scoring functions, using gpt-5-nano as generator and Qwen3-4B as scorer . The beneﬁt of reference access is consistent with the MA TH- 1K and F ActScore results. 22 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Qwen3-8B across datasets. Figures 21–23 repeat the analysis with Qwen3-8B . The trends are qualitativ ely similar to Qwen3-4B : reference access yields consistent improvements, although the magnitude of the gain varies across datasets. This suggests that the beneﬁt of feeding references to the scorer is not an artifact of a particular model scale. Figure 21: Performance of model conﬁdence score on F ActScore with and without reference pro vided to scoring functions, using gpt-5-nano as generator and Qwen3-8B as scorer . Figure 22: Performance of model conﬁdence score on MA TH-1K with and without reference pro vided to scoring functions, using gpt-5-nano as generator and Qwen3-8B as scorer . 23 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Figure 23: Performance of model conﬁdence score on NQ-1K with and without reference provided to scoring functions, using gpt-5-nano as generator and Qwen3-8B as scorer . 24 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Llama-3.2-3B across datasets. Figures 24 – 26 present the same comparison for Llama-3.2-3B-Instruct . The reference beneﬁt is again observ ed, conﬁrming that it is not speciﬁc to the Qwen3 family . Notably , on F ActScore, the improv ement in sufﬁcient correctness from adding a reference is among the largest we observ e across all model–dataset pairs. Figure 24: Performance of model conﬁdence score on F ActScore with and without reference pro vided to scoring functions, using gpt-5-nano as generator and Llama-3.2-3B-Instruct as scorer . Figure 25: Performance of model conﬁdence score on MA TH-1K with and without reference pro vided to scoring functions, using gpt-5-nano as generator and Llama-3.2-3B-Instruct as scorer . 25 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Figure 26: Performance of model conﬁdence score on NQ-1K with and without reference provided to scoring functions, using gpt-5-nano as generator and Llama-3.2-3B-Instruct as scorer . 26 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T SmolLM2-1.7B across datasets. Finally , Figures 27 – 29 sho w results for SmolLM2-1.7B-Instruct , the smallest LLM-based scorer in our study . Even at this scale, reference access improves scoring performance, although the absolute gains are more modest. This is consistent with the hypothesis that smaller models hav e less capacity to lev erage long reference contexts, b ut still beneﬁt from the additional grounding signal. Figure 27: Performance of model conﬁdence score on F ActScore with and without reference pro vided to scoring functions, using gpt-5-nano as generator and SmolLM2-1.7B-Instruct as scorer . Figure 28: Performance of model conﬁdence score on MA TH-1K with and without reference pro vided to scoring functions, using gpt-5-nano as generator and SmolLM2-1.7B-Instruct as scorer . 27 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Figure 29: Performance of model conﬁdence score on NQ-1K with and without reference provided to scoring functions, using gpt-5-nano as generator and SmolLM2-1.7B-Instruct as scorer . 28 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A.1.4 Model Choice for Scor ers Section 4.3 examines ho w the scorer model family and scale af fect f actuality ﬁltering, using gpt-5-nano as the generator . The main paper presents a condensed summary; here we provide the full set of radar plots at each target factuality le vel ( 1 − α ∈ { 0 . 75 , 0 . 8 , 0 . 85 , 0 . 9 , 0 . 95 } ) for all three model families ( Llama-3.x , Qwen3 , and SmolLM2 ). Figure 30 rev eals the heterogeneous scaling behaviors discussed in Section 4.3 in greater detail. While the Llama-3.x family shows monotonic improvement with model size across most metrics and factuality targets, this pattern does not hold for Qwen3 or SmolLM2 . For Qwen3 , the smallest model (0.6B) sometimes outperforms larger v ariants, and for SmolLM2 , there is no systematic ordering by parameter count. These results reinforce the main-paper conclusion that scaling alone does not guarantee improv ed conformal factuality . Figure 30: Overall performance comparison of model conﬁdence scores across dif ferent model scales and families at ﬁv e target f actuality lev els ( 1 − α ) on the F ActScore dataset. While Llama-3.x models show consistent gains with scale, Qwen3 and SmolLM2 do not, highlighting that scaling alone is insufﬁcient for impro ving conformal factuality . 29 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A.1.5 Comparison between Entailment-based and LLM-based Scoring Functions Section 4.4 compares model conﬁdence score with entailment-based scoring functions on the F ActScore dataset. Here, we extend the results to the MA TH-1K dataset. Figure 31: Comparison between entailment-based scores against the model conﬁdence score on the MA TH-1K dataset. As we can see in Figure 31, model conﬁdence score yields a better power and non-empty rate. Howe ver , in terms of empirical factuality and non-vacuous empirical f actuality , the gap between the dif ferent scoring functions is close. Moreov er , with some models, the document entailment score has a better sufﬁcient correctness and and accuracy comparing to model conﬁdence score. 30 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A.1.6 Sufﬁcient Correctness and Conditional Sufﬁcient Corr ectness Section 4.5 introduces Conditional Suf ﬁcient Correctness (CSC) and reports results for the Qwen3 family . Here, we ﬁrst describe how SC and CSC is measured. Then, we extend this analysis to three additional model families— Llama-3.x , SmolLM2 , and gpt-oss —to assess whether the observed patterns generalize. Measuring (conditional) sufﬁcient correctness . Let y and y ′ denote the original and ﬁltered outputs for the same input, respecti vely . In this section, suf ﬁcient correctness measures whether the ﬁltered output y ′ contains enough correct information to recover the answer . T o ev aluate this, we use gpt-5-nano with the prompt in Appendix A.4.13, replacing the placeholder {response} with the string representation of y ′ . Let n denote the number of examples in the dataset. W e then deﬁne sufﬁcient correctness as SC = 1 n n X i =1 I [ y ′ i is sufﬁcient correct ] . Conditional suf ﬁcient correctness is deﬁned by restricting attention to examples for which the original output y is already sufﬁcient correct. Formally , CSC = P n i =1 I [ y ′ i is sufﬁcient correct ] P n i =1 I [ y i is sufﬁcient correct ] . Llama-3.x family . Figure 32 shows SC and CSC for Llama-3.x at α = 0 . 05 . As with Qwen3 , CSC consistently exceeds SC, conﬁrming that a substantial portion of SC failures are attrib utable to the generator rather than the ﬁlter . The gap is especially wide on MA TH-1K, where the smaller models struggle to produce sufﬁciently correct unﬁltered outputs. Llama 3.2 1B Instruct Llama 3.2 3B Instruct Llama 3.1 8B Instruct Model 0.0 0.2 0.4 0.6 0.8 (Conditional) Sufficient Cor r ectness dataset = F A ctScor e Llama 3.2 1B Instruct Llama 3.2 3B Instruct Llama 3.1 8B Instruct Model dataset = MA TH-1K Llama 3.2 1B Instruct Llama 3.2 3B Instruct Llama 3.1 8B Instruct Model dataset = NQ -1K T ype Cond. SC SC Figure 32: Sufﬁcient Correctness (SC) and Conditional Suf ﬁcient Correctness (CSC) for the Llama-3.x family at α = 0 . 05 across F ActScore, MA TH-1K, and NQ-1K. CSC consistently exceeds SC, indicating that ﬁltering largely preserves useful content when the generator pro vides it. SmolLM2 family . Figure 33 presents the same comparison for SmolLM2 . The absolute SC v alues are lo wer due to the smaller model sizes, but the CSC–SC gap is qualitati vely similar . Notably , there is no consistent improvement in SC or CSC with model size within this family , echoing the main-paper ﬁnding that scaling the scorer does not reliably improv e conformal factuality . SmolLM2 135M Instruct SmolLM2 360M Instruct SmolLM2 1.7B Instruct Model 0.0 0.1 0.2 0.3 0.4 0.5 (Conditional) Sufficient Cor r ectness dataset = F A ctScor e SmolLM2 135M Instruct SmolLM2 360M Instruct SmolLM2 1.7B Instruct Model dataset = MA TH-1K SmolLM2 135M Instruct SmolLM2 360M Instruct SmolLM2 1.7B Instruct Model dataset = NQ -1K T ype Cond. SC SC Figure 33: Sufﬁcient Correctness (SC) and Conditional Sufﬁcient Correctness (CSC) for the SmolLM2 family at α = 0 . 05 across F ActScore, MA TH-1K, and NQ-1K. No consistent scaling beneﬁt is observed within this f amily . 31 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T gpt-oss family . Figure 34 compares the two gpt-oss models (20B and 120B). Despite a roughly 6 × difference in total parameters, the larger model does not consistently outperform the smaller one in SC or CSC, further supporting the conclusion that parameter count is not the primary driv er of conformal factuality performance. gpt oss 20b gpt oss 120b Model 0.0 0.2 0.4 0.6 0.8 (Conditional) Sufficient Cor r ectness dataset = F A ctScor e gpt oss 20b gpt oss 120b Model dataset = MA TH-1K gpt oss 20b gpt oss 120b Model dataset = NQ -1K T ype Cond. SC SC Figure 34: Sufﬁcient Correctness (SC) and Conditional Sufﬁcient Correctness (CSC) for the gpt-oss family at α = 0 . 05 across F ActScore, MA TH-1K, and NQ-1K. The larger gpt-oss-120b does not consistently outperform gpt-oss-20b . 32 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A.1.7 Robustness to Calibration Distrib ution Shift Section 5.1 studies the sensitivity of conformal factuality guarantees to distribution shift between calibration and test data, using Qwen3-4B as the scorer . Here we extend this analysis to Llama-3.2-3B-Instruct and SmolLM2-360M-Instruct to assess whether the robustness (or lack thereof) of dif ferent scoring functions is consistent across model families. Llama-3.2-3B-Instruct. Figure 35 shows empirical factuality under same-distribution and different-distrib ution calibration for Llama-3.2-3B-Instruct . When calibration data come from a different distrib ution (left panels), the factuality guarantee frequently f ails across all three datasets. Interestingly , the entailment-based scorers, which appeared robust to distrib ution shift under Qwen3-4B , show de graded performance here, indicating that robustness to distribution shift is model-dependent rather than a univ ersal property of any particular scoring function family . 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.4 0.6 0.8 1.0 Empirical Factuality NQ-1K Dif ferent Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) MA TH-1K Dif ferent Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) F ActScore Dif ferent Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) NQ-1K Same Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) MA TH-1K Same Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) F ActScore Same Distribution Model Confidence Score Document Entailment Score Conservative Entailment Score A verage Entailment Score Figure 35: Empirical factuality (EF) on F ActScore, MA TH-1K, and NQ-1K under same-distribution and different- distribution calibration, using Llama-3.2-3B-Instruct as the scorer . Distribution shift causes the f actuality guarantee to break for sev eral scoring functions. SmolLM2-360M-Instruct. Figure 36 presents the same analysis for SmolLM2-360M-Instruct . The pattern is consistent: different-distrib ution calibration leads to violations of the target factuality le vel, particularly on NQ-1K and MA TH-1K. T ogether with the Qwen3-4B and Llama-3.2-3B results, these ﬁndings underscore that calibration data must be collected using the same generator and scorer that will be deployed, as even switching the underlying LLM for the same dataset can break the exchangeability assumption. 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.4 0.6 0.8 1.0 Empirical Factuality NQ-1K Dif ferent Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) MA TH-1K Dif ferent Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) F ActScore Dif ferent Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) NQ-1K Same Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) MA TH-1K Same Distribution 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) F ActScore Same Distribution Model Confidence Score Document Entailment Score Conservative Entailment Score A verage Entailment Score Figure 36: Empirical factuality (EF) on F ActScore, MA TH-1K, and NQ-1K under same-distribution and different- distribution calibration, using SmolLM2-360M-Instruct as the scorer . The results conﬁrm that distribution shift in the calibration set undermines conformal factuality guarantees across model families. 33 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A.1.8 Robustness to Adv ersarial Distractors Section 5.2 examines how injecting adversarial distractor claims into the test set degrades empirical factual- ity , using Qwen3-4B on F ActScore as the primary example. Here we extend this analysis to additional scorers ( Llama-3.2-3B-Instruct and SmolLM2-1.7B-Instruct ) and datasets (MA TH-1K and NQ-1K) to assess the gen- erality of this vulnerability . F ActScore dataset. Figures 37 and 38 show that both Llama-3.2-3B-Instruct and SmolLM2-1.7B-Instruct exhibit the same sharp degradation in empirical factuality as the distractor rate increases, consistent with the Qwen3-4B results. This conﬁrms that the vulnerability to adversarial distractors is not model-speciﬁc but a systemic issue with the current conformal ﬁltering framew ork. 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (b) T est: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (c) T est: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (d) T est: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 P ower (e) T est: 0.1 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 37: Empirical factuality under varying test distractor proportions ( 0 . 0 to 0 . 5 ) with ﬁxed clean calibration data on F ActScore, using Llama-3.2-3B-Instruct . F actuality degrades sharply as distractor rate increases. 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (b) T est: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (c) T est: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (d) T est: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 P ower (e) T est: 0.1 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 38: Empirical factuality under varying test distractor proportions ( 0 . 0 to 0 . 5 ) with ﬁxed clean calibration data on F ActScore, using SmolLM2-1.7B-Instruct . The de gradation pattern is consistent with other model families. 34 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T MA TH-1K dataset. Figures 39–41 extend the distractor rob ustness analysis to MA TH-1K. Mathematical claims are particularly susceptible to subtle numerical perturbations, and indeed we observe that empirical factuality drops rapidly ev en at low distractor rates (e.g., 10%). This suggests that distractor robustness is an especially critical concern for reasoning-intensiv e tasks. 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (b) T est: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (c) T est: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (d) T est: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 P ower (e) T est: 0.1 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 39: Empirical factuality under v arying test distractor proportions on MA TH-1K with Qwen3-4B . Mathematical claims are especially vulnerable to subtle numerical perturbations. 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (b) T est: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (c) T est: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (d) T est: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 P ower (e) T est: 0.1 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 40: Empirical factuality under v arying test distractor proportions on MA TH-1K with Llama-3.2-3B-Instruct . 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (b) T est: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (c) T est: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (d) T est: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 P ower (e) T est: 0.1 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 41: Empirical factuality under v arying test distractor proportions on MA TH-1K with SmolLM2-1.7B-Instruct . 35 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T NQ-1K dataset. Figures 42 – 44 complete the analysis on NQ-1K. The o verall pattern is consistent across all three datasets: as distractors are injected, empirical factuality drops belo w the target le vel, and reco vering factuality by raising the target threshold comes at a steep cost in po wer . This reinforces the main-paper conclusion that de veloping scoring functions robust to adv ersarial content is a critical direction for future work. 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (b) T est: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (c) T est: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (d) T est: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 P ower (e) T est: 0.1 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 42: Empirical factuality under varying test distractor proportions on NQ-1K with Qwen3-4B . 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (b) T est: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (c) T est: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (d) T est: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 P ower (e) T est: 0.1 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 43: Empirical factuality under varying test distractor proportions on NQ-1K with Llama-3.2-3B- Instruct . 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (b) T est: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (c) T est: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 (d) T est: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 P ower (e) T est: 0.1 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 44: Empirical factuality under varying test distractor proportions on NQ-1K with SmolLM2-1.7B- Instruct . 36 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A.1.9 Robustness with Adv ersarial Prepar ed Calibration Section 5.3 in vestigates whether including distractors in the calibration set can restore factuality guarantees when distractors are also present at test time. The main paper presents results for Qwen3-4B on F ActScore. Here we provide the complete set of results across all three datasets and three scorer models. The key question addressed by these e xperiments is: If we anticipate adversarial distractor s during deployment, can we pr otect the factuality guarantee by injecting synthetic distractors into the calibration set? The answer , as the following ﬁgures show , is nuanced: matching the distractor proportion in calibration to that in the test set does restore empirical factuality to the target lev el, but at a severe cost to the non-empty rate. This trade-off arises because the conformal threshold becomes more stringent when calibration data contain distractors, causing the ﬁlter to aggressi vely remo ve content—including correct claims. F ActScore dataset. Figures 45 and 46 show the factuality–cov erage trade-off for Llama-3.2-3B-Instruct and SmolLM2-1.7B-Instruct on F ActScore. When calibration distractor proportion is underestimated (panels a), empiri- cal factuality remains belo w the target. Matching the proportion (panels b) restores the guarantee, and overestimating it (panels c) yields conservati ve but highly restricti ve ﬁltering. Panels (d)–(e) illustrate the non-empty rate cost. 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.25, Calib.: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (b) T est: 0.25, Calib.: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (c) T est: 0.25, Calib.: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (d) T est: 0, Calib.: 0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (e) T est: 0.25, Calib.: 0.25 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 45: Empirical factuality and non-empty rates under varying calibration distractor proportions on F ActScore with Llama-3.2-3B-Instruct . Matching calibration to test distractor levels restores factuality b ut reduces the non-empty rate. 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.25, Calib.: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (b) T est: 0.25, Calib.: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (c) T est: 0.25, Calib.: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (d) T est: 0, Calib.: 0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (e) T est: 0.25, Calib.: 0.25 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 46: Empirical factuality and non-empty rates under varying calibration distractor proportions on F ActScore with SmolLM2-1.7B-Instruct . 37 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T MA TH-1K dataset. Figures 47 – 49 present the same analysis on MA TH-1K. The trade-of f is ev en starker here: the non-empty rate drops precipitously when distractors are introduced into calibration, reﬂecting the difﬁculty of distinguishing correct mathematical steps from plausible but incorrect ones. 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.25, Calib.: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (b) T est: 0.25, Calib.: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (c) T est: 0.25, Calib.: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (d) T est: 0, Calib.: 0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (e) T est: 0.25, Calib.: 0.25 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 47: Empirical factuality and non-empty rates under v arying calibration distractor proportions on MA TH-1K with Qwen3-4B . 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.25, Calib.: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (b) T est: 0.25, Calib.: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (c) T est: 0.25, Calib.: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (d) T est: 0, Calib.: 0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (e) T est: 0.25, Calib.: 0.25 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 48: Empirical factuality and non-empty rates under v arying calibration distractor proportions on MA TH-1K with Llama-3.2-3B-Instruct . 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.25, Calib.: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (b) T est: 0.25, Calib.: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (c) T est: 0.25, Calib.: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (d) T est: 0, Calib.: 0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (e) T est: 0.25, Calib.: 0.25 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 49: Empirical factuality and non-empty rates under v arying calibration distractor proportions on MA TH-1K with SmolLM2-1.7B-Instruct . 38 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T NQ-1K dataset. Figures 50–52 complete the analysis on NQ-1K. The overall pattern is consistent: distraction-aware calibration can restore the statistical guarantee, b ut the resulting outputs are frequently empty , highlighting a fundamental limitation of threshold-based ﬁltering when scoring functions cannot reliably distinguish correct claims from adversarial ones. 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.25, Calib.: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (b) T est: 0.25, Calib.: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (c) T est: 0.25, Calib.: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (d) T est: 0, Calib.: 0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (e) T est: 0.25, Calib.: 0.25 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 50: Empirical factuality and non-empty rates under v arying calibration distractor proportions on NQ-1K with Qwen3-4B . 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.25, Calib.: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (b) T est: 0.25, Calib.: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (c) T est: 0.25, Calib.: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (d) T est: 0, Calib.: 0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (e) T est: 0.25, Calib.: 0.25 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 51: Empirical factuality and non-empty rates under v arying calibration distractor proportions on NQ-1K with Llama-3.2-3B-Instruct . 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (a) T est: 0.25, Calib.: 0.1 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (b) T est: 0.25, Calib.: 0.25 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 EF (c) T est: 0.25, Calib.: 0.5 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (d) T est: 0, Calib.: 0 0.75 0.80 0.85 0.90 0.95 T a r g e t F a c t u a l i t y ( 1 ) 0.00 0.25 0.50 0.75 1.00 Non-empty R ate (e) T est: 0.25, Calib.: 0.25 Model Confidence Scor e Document Entailment Scor e Conservative Entailment Scor e A verage Entailment Scor e Figure 52: Empirical factuality and non-empty rates under v arying calibration distractor proportions on NQ-1K with SmolLM2-1.7B-Instruct . 39 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A.2 Distractor Generation Protocol A core assumption of the conformal factuality framew ork is that the test data are exchangeable with the calibration dataset. In practice, howe ver , this assumption may be mildly violated—for e xample, when LLM outputs contain hallucinated claims that were not represented in calibration. T o stress-test the rob ustness of scoring functions under such conditions (Section 5.2), we construct adversarial distractor claims as follo ws. For each query x i , the corresponding reference text R ( x ) , and the set of parsed claims { c i } , we prompt the LLM to modify each c i (conditioned on x and R ( x ) ) so that the result resembles a plausible hallucination. The full prompt used for this generation step is provided in Appendix A.4.6. Crucially , we want these distractor claims to be sufﬁciently con vincing that the model itself would believ e it generated them. T o enforce this, we apply a veriﬁcation step: after generating a hallucinated claim, we prompt the LLM to judge whether the claim could plausibly be one of its o wn outputs gi ven x and R ( x ) (prompt in Appendix A.4.7). If the model identiﬁes the claim as a plausible self-generated hallucination, we retain it; otherwise, we regenerate until a sufﬁciently con vincing distractor is found. This two-stage process ensures that the adversarial distractors are realistic and rele vant to the failure modes of modern LLMs. 40 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A.3 Human Evaluation T o validate the use of gpt-5-nano as a factuality judge (Section 2.3), we conducted a human ev aluation study . W e randomly sampled 200 claims from the F ActScore dataset and had two graduate students (referred to as A and B) independently label the factuality of each claim. W e also obtained labels from gpt-5-nano on the same set. T able 3 reports the pairwise agreement rates. The agreement between gpt-5-nano and each human annotator (76.5% and 77.0%) is comparable to—and slightly exceeds—the inter -annotator agreement between the two humans (73.0%). This result supports the use of gpt-5-nano as a factuality judge in our experimental pipeline, as it performs on par with individual human annotators. Pair Labelers Agreement Rate Model–Human gpt-5-nano vs. Student A 76.5% Model–Human gpt-5-nano vs. Student B 77.0% Human–Human Student A vs. Student B 73.0% T able 3: Agreement rates on factuality labels between gpt-5-nano and two human annotators, as well as between the two human annotators themselves. The model–human agreement is comparable to inter-human agreement. 41 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A.4 Prompts This section collects all prompts used throughout our experiments. W e organize them by their role in the pipeline: generation (Sections A.4.1 and A.4.2), claim parsing (Section A.4.3), factuality labeling (Sections A.4.4 and A.4.5), adversarial distractor generation and veriﬁcation (Sections A.4.6 and A.4.7), model conﬁdence scoring (Section A.4.8), claim merging (Sections A.4.9 – A.4.11), correctness e valuation (Section A.4.12), suf ﬁcient correctness evaluati on (Section A.4.13), and MA TH reference generation (Section A.4.14). All prompts instruct the model to return structured JSON5 output to facilitate automated parsing. A.4.1 Generator (with refer ence) You are a helpful assistant that answers queries strictly based on the provided reference text. , → Instructions: 1. You will be given: - A reference text - A query 2. Use only the information from the reference text to answer the query. 3. Do not include any information not supported by the reference text. Output Requirements: - Output ONLY a single VALID JSON5 object with EXACTLY this schema: { "response": "...answer strictly based on the reference text..." } JSON5 Rules: - Use DOUBLE QUOTES (") for all keys and all string values. - Escape double quotes inside string values as \". - Escape backslashes as \\. - No trailing commas in objects or arrays. - Use the exact top-level container specified and close it properly. - Do not include comments, code fences, or any text outside the JSON5 output. - Follow the schema exactly; do not add or omit keys. Do NOT include: - Any text, explanations, comments, or formatting outside of the JSON5. - Any code block delimiters (e.g., ``` json). Input: Reference Text: {reference} Query: {query} (Reiteration of the instruction) Answer the query strictly using only the reference text, and return a single JSON5 object with the key "response" only. , → Output: A.4.2 Generator (without refer ence) You are a helpful assistant that answers queries. Instructions: 1. You will be given: - A query 42 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Output Requirements: - Output ONLY a single VALID JSON5 object with EXACTLY this schema: { "response": "...answer..." } JSON5 Rules: - Use DOUBLE QUOTES (") for all keys and all string values. - Escape double quotes inside string values as \". - Escape backslashes as \\. - No trailing commas in objects or arrays. - Use the exact top-level container specified and close it properly. - Do not include comments, code fences, or any text outside the JSON5 output. - Follow the schema exactly; do not add or omit keys. Do NOT include: - Any text, explanations, comments, or formatting outside of the JSON5. - Any code block delimiters (e.g., ``` json). Input: Query: {query} (Reiteration of the instruction) Answer the query and return a single JSON5 object with the key "response" only. Output: A.4.3 Parser You are an AI assistant tasked with breaking down input text into small, self-contained claims for easy human verification. , → Instructions: 1. Parse the provided text into concise, independent, and non-overlapping subclaims. , → 2. Ensure each subclaim is: - As small and specific as possible. - Independent and self-contained. - Do not use pronouns like he, she, his, her, it, its, etc. - Explicitly mention subjects. - Factually complete without relying on context from other subclaims. 3. If the provided text is not a full sentence, use the provided text verbatim as the subclaim , → Output Requirements: 1. The result must be a VALID and COMPLETE JSON list of dictionaries. 2. Each dictionary must have the following structure: { "subclaim": "Subclaim text" } JSON Rules: - Ensure the JSON is STRICTLY VALID: - Use DOUBLE QUOTES ("") for all keys and string values. - DO NOT include trailing commas after the LAST item in arrays or objects. - Ensure ALL dictionaries are enclosed in curly braces {}. 43 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T - Ensure the JSON list is ENCLOSED in square brackets []. - CLOSE the JSON list properly with a closing square bracket ]. - DO NOT include any code block delimiters. - DO NOT include any additional text, explanations, or comments in the output—return ONLY the JSON list. , → Critical Requirement: - Validate and ensure the JSON is complete, properly CLOSED, and parseable. The final JSON MUST pass validation without errors. , → (Reiteration of the instruction) 1. Parse the provided text into concise, independent, and non-overlapping subclaims. 2. Ensure each subclaim is: - As small and specific as possible. - Independent and self-contained. - Do not use pronouns like he, she, his, her, it, its, etc. - Explicitly mention subjects. - Factually complete without relying on context from other subclaims. 3. If the provided text is not a full sentence, use the provided text verbatim as the subclaim , → Input: {input} Output: A.4.4 Labeler (without ground-truth answer) You are an AI assistant tasked with assigning a label to a claim based on its factuality. , → Instructions: 1. You are given: - A reference text for the query. - A query. - A claim made in response to the query. 2. Rate the factuality of the claim as a boolean: - true -> The claim is well-supported by the reference text. - false -> The claim contradicts the reference text, or is not well-supported by it. - If it is ambiguous, answer False. 3. Provide: - The part(s) of the reference text that directly support your decision. - A reasoning statement describing your rationale. 4. You must assign either true or false. Never return null or None. Output Requirements: - Output ONLY a single VALID JSON5 object with EXACTLY these keys: { "highlighted_text": "Part(s) of the reference text that support the decision.", "reasoning": "A reasoning statement describing your rationale.", "answer": true } - "answer" must be a boolean (true/false). JSON5 Rules: - Use DOUBLE QUOTES (") for all keys and all string values. - Escape double quotes inside string values as \". - Escape backslashes as \\. 44 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T - No trailing commas in objects or arrays. - Use the exact top-level container specified and close it properly. - Do not include comments, code fences, or any text outside the JSON5 output. - Follow the schema exactly; do not add or omit keys. Do NOT include: - Any text, explanations, comments, or formatting outside of the JSON5. - Any code block delimiters (e.g., ``` json). Examples: Example Input: Reference Text: Michael Scott is a fictional character in the NBC sitcom The Office, portrayed by Steve Carell. Michael is the regional manager of the Scranton, Pennsylvania branch of Dunder Mifflin, a paper company, for the majority of the series. Like his counterpart in the original British version of the show, David Brent, he is characterized as a largely incompetent, unproductive, unprofessional boss, though he is depicted as kinder and occasionally shown to be effective at his job in key moments. , → , → , → , → , → , → Query: Tell me a paragraph bio of Michael Scott. Claim: The fictional character Michael Scott is the regional manager of a paper company. , → Example Output: { "highlighted_text": "Michael Scott is a fictional character in the NBC sitcom The Office, portrayed by Steve Carell. Michael is the regional manager of the Scranton, Pennsylvania branch of Dunder Mifflin, a paper company, for the majority of the series.", , → , → , → "reasoning": "Reference explicitly states Michael is regional manager at a paper company.", , → "answer": true } Example Input: Reference Text: Michael Scott is a fictional character in the NBC sitcom The Office, portrayed by Steve Carell. Michael is the regional manager of the Scranton, Pennsylvania branch of Dunder Mifflin, a paper company, for the majority of the series. Like his counterpart in the original British version of the show, David Brent, he is characterized as a largely incompetent, unproductive, unprofessional boss, though he is depicted as kinder and occasionally shown to be effective at his job in key moments. , → , → , → , → , → , → Query: Tell me a paragraph bio of Michael Scott. Claim: The portrayal of Michael Scott in the NBC sitcom The Office is similar to that of David Brent, in the British version. , → Example Output: { "highlighted_text": "Like his counterpart in the original British version of the show, David Brent, he is characterized as a largely incompetent, unproductive, unprofessional boss, though he is depicted as kinder and occasionally shown to be effective at his job in key moments.", , → , → , → "reasoning": "Reference compares Michael Scott ' s characterization to David Brent ' s, indicating similarity.", , → "answer": true } Example Input: 45 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Reference Text: Michael Scott is a fictional character in the NBC sitcom The Office, portrayed by Steve Carell. Michael is the regional manager of the Scranton, Pennsylvania branch of Dunder Mifflin, a paper company, for the majority of the series. Like his counterpart in the original British version of the show, David Brent, he is characterized as a largely incompetent, unproductive, unprofessional boss, though he is depicted as kinder and occasionally shown to be effective at his job in key moments. , → , → , → , → , → , → Query: Tell me a paragraph bio of Michael Scott. Claim: Michael Scott is the founder of The Michael Scott Paper Company. Example Output: { "highlighted_text": "Michael is the regional manager of the Scranton, Pennsylvania branch of Dunder Mifflin, a paper company, for the majority of the series.", , → "reasoning": "Reference mentions only Dunder Mifflin; founding another company is not supported.", , → "answer": false } Example Input: Reference Text: Michael Scott is a fictional character in the NBC sitcom The Office, portrayed by Steve Carell. Michael is the regional manager of the Scranton, Pennsylvania branch of Dunder Mifflin, a paper company, for the majority of the series. Like his counterpart in the original British version of the show, David Brent, he is characterized as a largely incompetent, unproductive, unprofessional boss, though he is depicted as kinder and occasionally shown to be effective at his job in key moments. , → , → , → , → , → , → Query: Tell me a paragraph bio of Michael Scott. Claim: Michael Scott is the CEO of Dunder Mifflin. Example Output: { "highlighted_text": "Michael is the regional manager of the Scranton, Pennsylvania branch of Dunder Mifflin, a paper company, for the majority of the series.", , → "reasoning": "Reference states regional manager, not CEO.", "answer": false } Input: Reference Text: {reference} Query: {query} Claim: {claim} (Reiteration of the instruction) Return a single JSON5 object with "highlighted_text", "reasoning", and "answer" (true if supported; false if contradicted or unsupported). Assign true or false—never null/None. , → , → Output: A.4.5 Labeler (with ground-truth answer) You are an AI assistant tasked with assigning a label to a claim based on its factuality. , → Instructions: 1. You are given: 46 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T - A reference text for the query. - A query. - A provided solution (final answer) - A claim made in response to the query. 2. Rate the factuality of the claim as a boolean: - true → The claim is well-supported by the reference text or match the given provided solution (final answer). , → - false → The claim contradicts the reference text or is not well-supported by it and contradicts the provided solution (final answer). , → 3. Provide: - The part(s) of the reference text or the provided solution that directly support your decision. , → - A reasoning statement describing your rationale. 4. You must assign either true or false. Never return null or None. Output Requirements: - Output ONLY a single VALID JSON5 object with EXACTLY this schema: { "highlighted_text": "Part(s) of the reference text or the provided solution that directly support the decision.", , → "reasoning": "A reasoning statement describing your rationale.", "answer": true } - "answer" must be a boolean (true/false). JSON5 Rules: - Use DOUBLE QUOTES (") for all keys and all string values. - Escape double quotes inside string values as \". - Escape backslashes as \\. - No trailing commas in objects or arrays. - Use the exact top-level container specified and close it properly. - Do not include comments, code fences, or any text outside the JSON5 output. - Follow the schema exactly; do not add or omit keys. Do NOT include: - Any text, explanations, comments, or formatting outside of the JSON5. - Any code block delimiters (e.g., ``` json). Examples: Example Input: Reference Text: "Paris is the capital of France." Answer: "Paris" Query: "What is the capital of France?" Claim: "Paris is the capital of France." Example Output: { "reasoning": "Claim matches reference and provided solution.", "answer": true } Example Input: Reference Text: "Michael is the regional manager of Dunder Mifflin." Answer: "Regional manager" Query: "What is Michael ' s role at Dunder Mifflin?" Claim: "Michael is the CEO of Dunder Mifflin." Example Output: 47 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T { "reasoning": "Claim contradicts the reference and solution.", "answer": false } Example Input: Reference Text: "Compute 12 × 3 + 4." Answer: "40" Query: "Evaluate 12 × 3 + 4." Claim: "The value is 40." Example Output: { "reasoning": "Computation equals 40; claim matches the solution.", "answer": true } Example Input: Reference Text: "Compute 12 × 3 + 4." Answer: "40" Query: "Evaluate 12 × 3 + 4." Claim: "The value is 42." Example Output: { "reasoning": "The solution gives 40, not 42.", "answer": false } Example Input: Reference Text: "The sum of the interior angles of a triangle is 180 degrees." Answer: "180 degrees" Query: "What is the sum of the interior angles of a triangle?" Claim: "The sum is 180 degrees." Example Output: { "reasoning": "Claim agrees with the stated geometric fact and the solution.", "answer": true } Input: Reference Text: {reference} Answer: {answer} Query: {query} Claim: {claim} (Reiteration of the instruction) Return a single JSON5 object with "highlighted_text", "reasoning", and "answer" (true if supported; false if contradicted or unsupported). Assign true or false—never null/None. , → , → Output: 48 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T A.4.6 Attacker You are a helpful assistant that modifies a correct claim into a false but plausible claim ("hallucination") given a reference text and a query. , → Instructions: 1. Read the reference text, the query, the original claim, and the two lists of prior modified claims (that do not work / that do work). , → 2. Generate exactly one false claim that: - Directly answers the query. - Contradicts or is not supported by the reference text. - Appears plausible given the reference and query (e.g., near-miss number/date, closely related entity, role swap). , → - Does not match or paraphrase any claim in either provided list. 3. Provide a reasoning statement describing why the false claim might look plausible. Output Requirements: - Output ONLY a single VALID JSON5 object with EXACTLY this schema: { "reasoning": "A reasoning statement describing why the false claim might look plausible.", , → "subclaim": "One-sentence false but plausible answer to the query." } - The "subclaim" must be a standalone sentence that answers the query and is false with respect to the reference. , → JSON5 Rules: - Use DOUBLE QUOTES (") for all keys and all string values. - Escape double quotes inside string values as \". - Escape backslashes as \\. - No trailing commas in objects or arrays. - Use the exact top-level container specified and close it properly. - Do not include comments, code fences, or any text outside the JSON5 output. - Follow the schema exactly; do not add or omit keys. Do NOT include: - Any text, explanations, comments, or formatting outside of the JSON5. - Any code block delimiters (e.g., ``` json). Examples: Example Input: Modified claims that do not work: [] Modified claims that do work: [] Reference Text: "Gustave Eiffel oversaw the construction of the Eiffel Tower, which opened in 1889. The tower ' s design is credited to Gustave Eiffel." , → Query: "Who designed the Eiffel Tower?" Claim: "Gustave Eiffel designed the Eiffel Tower." Example Output: { "reasoning": "Reference credits Eiffel; another associated architect name may look credible.", , → "subclaim": "Stephen Sauvestre designed the Eiffel Tower." } Example Input: Modified claims that do not work: [] Modified claims that do work: [] Reference Text: "The capital of Australia is Canberra." 49 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Query: "What is the capital of Australia?" Claim: "Canberra." Example Output: { "reasoning": "Larger city misconception is common.", "subclaim": "Sydney is the capital of Australia." } Input: Modified claims that do not work: {claims_not_work} Modified claims that do work: {claims_work} Reference Text: {reference} Query: {query} Claim: {claim} (Reiteration of the instruction) Generate exactly one false but plausible claim answering the query, unsupported by or contradicting the reference, not overlapping with provided lists; then return a JSON5 object with "reasoning" and "subclaim". , → , → Output: A.4.7 Confusee You are a helpful assistant that evaluates whether a claim sounds like something you (an LLM) would plausibly say—and potentially hallucinate—based on the provided reference text and query. , → , → Instructions: 1. Read the reference text, the query, and the claim. 2. Decide whether the claim is a plausible LLM-style statement that could arise from the reference and query even if it is not supported or is false. , → 3. Return true if the claim is plausibly hallucinated; otherwise return false. 4. Provide a reasoning statement describing your rationale. Output Requirements: - Output ONLY a single VALID JSON5 object with EXACTLY this schema: { "reasoning": "A reasoning statement describing your rationale.", "answer": (true|false) } - "answer" must be a boolean (true/false). JSON5 Rules: - Use DOUBLE QUOTES (") for all keys and all string values. - Escape double quotes inside string values as \". - Escape backslashes as \\. - No trailing commas in objects or arrays. - Use the exact top-level container specified and close it properly. - Do not include comments, code fences, or any text outside the JSON5 output. - Follow the schema exactly; do not add or omit keys. Do NOT include: - Any text, explanations, comments, or formatting outside of the JSON5. - Any code block delimiters (e.g., ``` json). 50 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Examples: Example Input: Reference Text: "The capital of Australia is Canberra." Query: "What is the capital of Australia?" Claim: "Sydney is the capital of Australia." Example Output: { "reasoning": "Common misconception favors the largest city over the capital.", "answer": true } Example Input: Reference Text: "Paris is the capital of France." Query: "What is the capital of France?" Claim: "Paris is the capital of France." Example Output: { "reasoning": "Claim is directly supported by the reference; not a hallucination.", "answer": false } Input: Reference Text: {reference} Query: {query} Claim: {claim} (Reiteration of the instruction) Return a JSON5 object with "reasoning" and "answer" (true if plausibly hallucinated; else false). , → Output: A.4.8 Model Conﬁdence Score You are an AI assistant tasked with assigning a confidence score to a claim based on its factuality. , → Instructions: 1. You are given: - A reference text for the query. - A query. - A claim made in response to the query. 2. Rate the factuality of the claim with a numeric score in [0.0, 1.0]: - 0.0 -> The claim is false, contradicts the reference, or is not well-supported. - 1.0 -> The claim is factual and well-supported by the reference, or can be logically deduced from the reference and query. , → 3. Provide: - A score. - The parts of the reference text that directly support your scoring decision. - A reasoning statement describing your rationale. 4. You must assign a numeric score. Never return null, None, or a non-numeric value. Important: A claim should be considered true (score 1.0) if it is either: - Directly stated in the reference text, OR 51 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T - Can be logically deduced or calculated from the information provided in the reference text and query. , → For mathematical claims, perform the necessary calculations based on the given data. Output Requirements: - Output ONLY a single VALID JSON5 object with EXACTLY these keys: { "highlighted_text": "Part(s) of the reference text that support the decision.", "reasoning": "A reasoning statement describing your rationale.", "score": 0.0-1.0 } JSON5 Rules: - Use DOUBLE QUOTES (") for all keys and all string values. - Escape double quotes inside string values as \". - Escape backslashes as \\\. - No trailing commas in objects or arrays. - Follow the schema exactly. Do NOT include: - Any text, explanations, comments, or formatting outside of the JSON5. Input: Reference Text: {reference} Query: {query} Claim: {claim} Reiteration of Instructions: Return a single JSON5 object with "highlighted_text", "reasoning", "score". Assign a numeric score—never null/None. , → Output: A.4.9 Merger (F ActScore) You will get an instruction and a set of facts that are true. Construct an answer using ONLY the facts provided, and try to use all facts as long as its possible. If the input facts are empty, output the empty string. Do not repeat the instruction. , → , → , → Input: The facts: {claims} The instruction: {query} Remember, If the input facts are empty, output the empty string. Do not repeat the instruction. , → Output: A.4.10 Merger (Natural Questions) You will get a natural question and parts of an answer, which you are to merge into coherent prose. Make sure to include all the parts in the answer. There may be parts that are seemingly unrelated to the others, but DO NOT add additional information or reasoning to merge them. If the input parts are empty, output the empty string. Do not repeat the question. , → , → , → , → Input: The parts: {claims} The question: {query} 52 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Remember, DO NOT add any additional information or commentary, just combine the parts. If the input parts are empty, output the empty string. Do not repeat the question. , → , → Output: A.4.11 Merger (MA TH) You will get a math problem and a set of steps that are true. Construct an answer using ONLY the steps provided. Make sure to include all the steps in the answer, and do not add any additional steps or reasoning. These steps may not fully solve the problem, but merging them could assist someone in solving the problem. If the input steps are empty, output the empty string. Do not repeat the math problem. , → , → , → , → Input: The steps: {claims} The math problem: {query} Remember, do not do any additional reasoning, just combine the given steps. If the input steps are empty, output the empty string. Do not repeat the math problem. , → Output: A.4.12 Correctness I need your help in evaluating an answer provided by an LLM against ground truth answers. Your task is to determine if the LLM ' s response matches the ground truth answers. Please analyze the provided data and make a decision. , → , → Instructions: 1. Carefully compare the "Predicted Answer" with the "Ground Truth Answers". 2. Consider the substance of the answers – look for equivalent information or correct answers. Do not focus on exact wording unless the exact wording is crucial to the meaning. , → , → 3. Your final decision should be based on whether the meaning and the vital facts of the "Ground Truth Answers" are present in the "Predicted Answer." , → 4. Categorize the answer as one of the following: - "perfect": The answer is completely correct and matches the ground truth. - "acceptable": The answer is partially correct or contains the main idea of the ground truth. , → - "incorrect": The answer is wrong or contradicts the ground truth. - "missing": The answer is "I don ' t know", "invalid question", or similar responses indicating lack of knowledge. , → Output Requirements: - Output ONLY a single VALID JSON5 object with EXACTLY this schema: { "reasoning": "A reasoning statement describing your rationale.", "answer": "One of perfect, acceptable, incorrect, or missing" } JSON5 Rules: - Use DOUBLE QUOTES (") for all keys and all string values. - Escape double quotes inside string values as \". - Escape backslashes as \\. - No trailing commas in objects or arrays. - Use the exact top-level container specified and close it properly. - Do not include comments, code fences, or any text outside the JSON5 output. - Follow the schema exactly; do not add or omit keys. 53 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T Do NOT include: - Any text, explanations, comments, or formatting outside of the JSON5. - Any code block delimiters (e.g., ``` json). Input: Query: {query} Predicted Answer: {merged_string} Ground Truth Answer: {answer} Reiteration of Instructions (before output): 1. Carefully compare the "Predicted Answer" with the "Ground Truth Answers". 2. Consider the substance of the answers – look for equivalent information or correct answers. Do not focus on exact wording unless the exact wording is crucial to the meaning. , → , → 3. Your final decision should be based on whether the meaning and the vital facts of the "Ground Truth Answers" are present in the "Predicted Answer." , → 4. Categorize the answer as one of the following: - "perfect": The answer is completely correct and matches the ground truth. - "acceptable": The answer is partially correct or contains the main idea of the ground truth. , → - "incorrect": The answer is wrong or contradicts the ground truth. - "missing": The answer is "I don ' t know", "invalid question", or similar responses indicating lack of knowledge. , → Output: A.4.13 Sufﬁcient Correctness You are an expert LLM evaluator that excels at evaluating a RESPONSE with respect to a QUERY given a REFERENCE. , → Consider the following criteria: Sufficient Correctness: 1 IF the RESPONSE contains a sufficient amount of CORRECT information (verified against the REFERENCE) to infer the answer to the QUERY. , → 0 IF the RESPONSE does not contain a sufficient amount of CORRECT information to infer the answer to the QUERY. , → Important: Judge only the correctness of the RESPONSE content according to the REFERENCE. Do not infer correctness from external knowledge. Output ONLY a single VALID JSON5 object with EXACTLY these keys: { "explanation": An explanation describing your rationale, with with you will make your decision on sufficient_correctness, , → "sufficient_correctness": 1 IF the RESPONSE contains a sufficient amount of correct information (verified against the REFERENCE) to infer the answer to the QUERY, 0 IF the RESPONSE does not contain a sufficient amount of correct information to infer the answer to the QUERY. , → , → , → } - "sufficient_correctness" must be an integer (0/1). JSON5 Rules: - Use DOUBLE QUOTES (") for all keys and all string values. - Escape double quotes inside string values as \". - Escape backslashes as \\. 54 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T - No trailing commas in objects or arrays. - Use the exact top-level container specified and close it properly. - Do not include comments, code fences, or any text outside the JSON5 output. - Follow the schema exactly; do not add or omit keys. Do NOT include: - Any text, explanations, comments, or formatting outside of the JSON5. - Any code block delimiters (e.g., ``` json). Input: ### QUERY {query} ### REFERENCE {reference} ### RESPONSE {response} Output: A.4.14 MA TH refer ence You are a helpful assistant that extracts the prerequisite mathematics knowledge needed to answer a given question—without solving it. , → Instructions: 1. Read the question. 2. Identify only the minimal prerequisite items across concepts, definitions, theorems/properties, formulas, techniques, notation, assumptions/conditions, and common pitfalls required to answer the question. , → , → 3. When you state a theorem, a definition, or a formula, write out its full, standard statement (not just the name). For theorems, include hypotheses and conclusions; for definitions, give the precise meaning; for formulas, write the exact equation(s) in standard notation. , → , → , → 4. Do NOT provide the answer or partial solution steps. 5. Do NOT use examples that arise directly from the given question; keep statements general and problem-agnostic. , → Output Requirements: - Produce PLAIN TEXT ONLY. - Write free-form prose (no headings, lists, or numbering). - Mention required topics, definitions, fully written theorems/properties, formulas, techniques, notation conventions, assumptions/conditions, and common pitfalls in sentences. , → , → - The output may be long; include complete statements where needed. Do NOT include: - Any answer, hints, or step-by-step solution. - Any examples derived from the given question. Reiteration of the instructions: 1. Read the question. 2. Identify only the minimal prerequisite items across concepts, definitions, theorems/properties, formulas, techniques, notation, assumptions/conditions, and common pitfalls required to answer the question. , → , → 3. When you state a theorem, a definition, or a formula, write out its full, standard statement (not just the name). For theorems, include hypotheses and conclusions; for definitions, give the precise meaning; for formulas, write the exact equation(s) in standard notation. , → , → , → 55 Is Conformal Factuality for RA G-based LLMs Robust? A P R E P R I N T 4. Do NOT provide the answer or partial solution steps. 5. Do NOT use examples that arise directly from the given question; keep statements general and problem-agnostic. , → Input: {query} Output: 56

Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment