Conformal Tradeoffs: Operational Profiles Beyond Coverage

Conformal prediction gives exact finite-sample coverage guarantees under exchangeability, but deployed systems are judged by more than coverage alone. For a fixed calibrated rule reused over a finite operational window, stakeholders also care about d…

Authors: Petrus H. Zwart

Conformal Tradeoffs: Operational Profiles Beyond Coverage
Confo rmal T radeoffs: Gua rantees Bey ond Coverage P etrus H. Zw art PHZwart@lbl.gov Center for A dvanc e d Mathematics in Ener gy R ese ar ch Applic ations, Berkeley Synchr otr on Infrar e d Structur al Biolo gy Pr o gr am, & Mole cular Biophysics and Inte gr ate d Bioimaging Division, 1 Cyclotr on R o ad, Berkeley, CA 94720, USA Abstract Deplo y ed conformal predictors are long-liv ed decision infrastructure operating o ver finite op erational windows. The real-world question is not merely “Do es the true lab el lie in the prediction set at the target rate?”, i.e. marginal co verage, but “Ho w often do es the deploy ed system commit versus defer? What error exp osure do es it induce when it acts? Ho w do these op erational rates trade off?” Marginal cov erage do es not summarize these deplo yment- facing quantities: the same calibrated thresholds can yield very different op erational profiles dep ending on score geometry . This paper pro vides a framew ork for op erational certification and planning b eyond co v er- age. W e make three con tributions. (1) Small-Sample Beta Correction (SSBC): W e inv ert the exact finite-sample Beta/rank law for split conformal to map a user cov erage request ( α ⋆ , δ ) into a concrete calibrated grid p oint with P A C-style semantics, providing explicit finite-windo w co verage guaran tees for the deploy ed rule. (2) Calibrate-and-A udit: Because no distribution-free piv ot exists for rates b eyond cov erage, w e in tro duce a tw o-stage de- sign: an indep endent audit set yields a reusable region–lab el table that supp orts certified finite-windo w predictiv e en v elop es (Binomial/Beta–Binomial) for op erational quantities— commitmen t frequency , deferral, decisive error exposure, commit purit y—or related quan ti- ties via linear pro jection, without committing to a sp ecified scalar ob jective. (3) Geometric c haracterization: W e exp ose the feasibility constraints, regime b oundaries (hedging vs. re- jection), and cost-coherence conditions induced by a fixed conformal partition, explaining wh y operational rates are coupled and how calibration navigation trades them off. The output is an auditable op erational men u: giv en a scoring model, the framew ork traces the P areto frontier of attainable operational profiles across calibration settings and attaches finite-windo w uncertaint y env elop es to each regime. W e demonstrate the approac h on T ox21 to xicit y prediction (12 endpoints) and aqueous solubility screening of lip ophilic drug candi- dates via the AquaSolDB dataset. 1 Intro duction: deplo yment-facing confo rmal prediction Man y classifiers are deplo y ed as long-lived decision infrastructure rather than one-off prediction engines. A single trained scoring mo del is integrated into do wnstream workflo ws and reused o ver finite op erational windo ws. In this regime, the deploy ed object is not the score function alone, but the rule : (scoring mo del, one- time calibration pro cedure, and downstream action con ven tion). The observ ables that matter to stakeholders are finite-window op er ational r ates : ho w often the system commits versus defers/abstains, and—conditional on commitment—the decisive error exposure and purity induced b y attached actions (cf. rejection options and selectiv e classification (Chow, 1970)). These realized rates drive resource planning, safety and compliance, and near-term impact. Conformal prediction (V o vk et al., 2005; Shafer & V o vk, 2008) is a natural starting point for deplo yment- facing calibration: it sets score thresholds on held-out data and yields finite-sample, distribution-free c over age under exc hangeabilit y (P apadop oulos et al., 2002). Cov erage, ho wev er, is not an operational sp ecification. Once a deploymen t conv ention is fixed (e.g., “commit on singletons, defer on hedges or empty sets”), the realized op erational profile dep ends on ho w calibration choices partition score space and ho w mass and lab els 1 p opulate the induced regions. As a result, tw o conformal rules with iden tical marginal cov erage can b ehav e v ery differently in deploymen t. 1.1 The gap: coverage do es not determine op erational b ehavior Standard conformal guarantees that the true lab el lies in the prediction set at the target cov erage rate (under exc hangeabilit y). But c over age alone do es not determine the op er ational pr ofile of a deploy ed conformal rule. In particular, different calibrated predictors with the same nominal cov erage can exhibit markedly differen t: • Commitmen t vs. deferral: the frequency of singleton outputs (and th us decisive actions) versus hedges/absten tions. • Decisiv e error exp osure: the error rate among c ommitte d pr e dictions , whic h is the error that “escap es” deferral mec hanisms. • T rade-off coupling: improving one op erational rate (e.g., reducing deferral) ma y force degradation in another (e.g., increasing decisiv e error), with couplings determined b y the score distribution geometry rather than by cov erage alone. These are first-order questions for op erational readiness and trust calibration, yet they are not easily answered b y cov erage certificates alone. 1.2 Our approach: calibrate, exp ose geometry , and audit once W e study a single deploy ed split-conformal rule through a calibration-conditional, certification-style lens. A calibration choice fixes cutp oints on the score axis, inducing a finite partition of score/output space into regions with set-v alued outcomes (e.g., singleton, hedge, abstain). The key device in this pap er is to make this interme diate ge ometry explicit and auditable: once the region partition is fixed, many p olicy- and Key P erformance Indicator (KPI)-specific rates are obtained by simple pro jection from the joint region–lab el masses. This supp orts trade-off analysis across calibration settings and enables distribution-free uncertaint y statemen ts ab out futur e r e alize d finite-window r ates for the deploy ed rule. Figure 1 previews the main ob ject pro duced by this pap er: an auditable op erational men u mapping cali- bration settings to certified finite-window op erational profiles, together with a Pareto filter that highlights nondominated op erating regimes. 1.3 Contributions This pap er makes op erational quantities first-class ob jects of certification and exploration for deploy ed con- formal predictors. Our contributions are: (1) Cov erage seman tics via Small-Sample Beta Correction (SSBC). W e inv ert the exact finite- sample rank/Beta la w for split conformal (Marques, 2025) to map a user request ( α ⋆ , δ ) (e.g., “90% cov erage with 90% confidence”) to the least conserv ativ e split-conformal grid p oin t satisfying a P A C-style tail con- strain t: P D cal  P  Y ∈ ˆ C ( X ) | D cal  ≥ 1 − α ⋆  ≥ 1 − δ. This provides an explicit, auditable finite-sample cov erage guarantee for the deploy ed rule and serves as a seman tic anchor for subsequent op erational navigation. In the binary class-conditional case, SSBC collapses a four-dimensional user specification in to a tw o-dimensional calibration co ordinate, simplifying trade-off exploration. (2) Calibrate-and-Audit for op erational certification b eyond co v erage. Op erational rates such as commitmen t, deferral, and decisive error exp osure are not rank-determined and therefore do not admit a conformal “cov erage-style” pivot. W e introduce a tw o-stage design: 2 Figure 1: Op erational men u induced b y a calibrated conformal rule on a synthetic mo del. Left: a calibration choice fixes score cutp oints that partition score space into set-v alued outcomes (e.g., singleton, hedge, abstain) and asso ciated region–lab el probabilities. A deploymen t conv ention determines which out- comes trigger commitmen t v ersus deferral. Righ t: sw eeping calibration settings traces the attainable set of selected op erational rates; the oriented Pareto frontier highlights nondominated op erating regimes. • Calibrate: fix thresholds on D cal , inducing a region partition R τ . • A udit: use an indep endent exchangeable set D audit to estimate the joint region–lab el table { p r,y } . The joint table is a reusable sufficien t statistic: a wide class of op erational KPIs are linear pro jections (or monotone transforms) of { p r,y } . This mo del, via an induced Binomial/Beta–Binomial sampling mo del of even ts of interest, yields exact finite-window predictive en v elop es for future realized op erational rates, enabling certified p olicy comparison across many calibration settings without re-auditing. W e also provide a conserv ative leav e-one-out proxy for single-sample settings (App endix D). (3) Geometry and feasibilit y of op erational trade-offs. W e exp ose how a fixed conformal partition couples attainable op erational rates. In the binary probability-normalized case: • Regime b oundaries: the sum τ 0 + τ 1 determines whether the system can hedge ( τ 0 + τ 1 > 1 ) or m ust reject ( τ 0 + τ 1 < 1 ), inducing sharp transitions in attainable profiles (Section 4). • Conserv ation constraints: v arying thresholds reallo cates mass across regions, explaining struc- tural trade-offs b et w een commitment and decisive error exp osure. • Cost coherence: we derive compatibilit y conditions b etw een region-based action conv en tions and scalar cost mo dels, in adv ance of any cost mo del sp ecification that are linear transformations of the auditable primitives (App endix F). T ogether, these results explain why some desired op erational profiles are infeasible, and provide better guidance for practitioners to navigate emergent trade-offs induced by conformal calibration choices. 1.4 P ositioning and related wo rk Conformal prediction constructs set-v alued predictors with finite-sample co v erage guarantees under ex- c hangeabilit y (V o vk et al., 2005; Shafer & V o vk, 2008; Angelop oulos & Bates, 2023). Conformal risk control 3 (CR C) generalizes this viewp oint by selecting thresholds to satisfy user-sp ecified sc alar risk constraints un- der related assumptions (Angelop oulos et al., 2024; Bates et al., 2021), and extensions address cov ariate shift via reweigh ting or adaptation (Tibshirani et al., 2019; F annjiang et al., 2022). Recen t work also con- nects conformal methods to downstream decision-making: conformal decision theory calibrates decision rules (often online) to control realized risk (Lekeufac k et al., 2023), and decision-theoretic foundations motiv ate conformal sets as uncertaint y ob jects for risk-av erse agents (Kiyani et al., 2025). Our fo cus is complementary . Rather than guaranteeing cov erage or a single scalar risk functional, w e certify finite-windo w operational rate v ectors for a fixed deplo y ed rule and supp ort m ulti-ob jective exploration without committing to a scalar ob jective. The audited primitive is the intermediate geometry induced by calibration—the region partition and its joint region–lab el table—from which man y deploymen t p olicies and KPIs follow b y pro jection. Closest in spirit to our “menu” viewp oint is the inv erse CR C of Zhou & Zhu (2025), who trace certified misco v erage–regret trade-offs across robustness levels for predict-then-optimize pip elines; they audit cov erage and regret as the do wnstream ob jects of interest. In our work, how ever, w e target an earlier deploymen t stage, where the cost mo del and operational targets are not yet fixed. At that stage we audit and certify op erational quantities—commit and deferral rates, error exp osure, purit y , and workload—and their 95% predictiv e env elop es that directly go v ern resource usage and risk ov er finite windows. A different line of work optimizes efficiency by selecting, from a family of candidate mo dels, the confor- mal predictor that minimizes a set-size functional (e.g., in terv al length/volume, or prediction-set c ar dinality in classification) subject to v alidity constraints (Y ang & Kuchibhotla, 2021). This is well-motiv ated when do wnstream utility is monotone in set size. Our ob jective differs: in many deploymen ts, smaller cardinality do es not necessarily imply higher utility , nor low er uncertaint y in the induced region-conditional lab el fre- quencies. In particular, aggressive singleton b ehavior can reduce a v erage set size while increasing “decisive error exp osure” and degrading the op erational profile that stakeholders ultimately care ab out. When orga- nizational priorities are still b eing determined, a set size proxy migh t not be informative enough to guide decision-making. Accordingly , we treat conformal sets as instrumen ts whose v alue is expressed through a v ector of auditable KPIs, and we certify the feasible trade-offs among these rates in finite windows. 1.5 Roadmap Section 2 formalizes the calibration-conditional viewpoint for a fixed deplo yed object and defines the region– lab el table as the auditable primitive. Section 3 presents SSBC and the Calibrate-and-Audit framew ork, yielding certified finite-window env elop es for op erational rates and Pareto-based trade-off navigation. Sec- tion 4 characterizes regime b oundaries and feasibility constraints induced by the conformal partition. Sec- tion 5 v alidates the framework empirically on T o x21 toxicit y prediction and aqueous solubility prediction, and Section 6 discusses limitations and extensions (multiclass, online adaptation, and distribution shift). 2 Setting and notation: calib ration-conditional viewp oint W e study a single deploye d pr e dictor pro duced by (i) training a scoring model on a training set and then treating it as fixed, follow ed by (ii) a one-time split conformal calibration on an exchangeable calibration sample. A ccordingly , all randomness in our analysis is ov er the one-time calibration draw and future de- plo ymen t data; we do not analyze v ariability due to retraining or rep eated recalibration. Later, to mak e op erational rates b eyond co verage auditable with certified finite-sample guarantees, we in tro duce a separate exc hangeable audit set held out alongside the calibration sample. 2.1 Data splits and exchangeability assumptions Let D train denote a training dataset used to fit a base scoring mo del, for instance a probabilistic classifier (Go o dfello w et al., 2016). After training, the resulting score function is treated as fixed. Let D cal = { ( X c i , Y c i ) } n cal i =1 , D audit = { ( X a i , Y a i ) } n audit i =1 . 4 The calibration sample D cal is used once to set the deploy ed rule’s thresholds; the audit sample D audit is held out (e.g., by splitting the calibration po ol) and reserved for measuring op erational quantities after thresholds are fixed, and is not used to choose thresholds. This approach is equiv alent to the double calibration set approac h of Y ang & Kuc hibhotla (2021). W e assume calibration, audit, and future deploymen t examples are jointly exchangeable (conditional on the fixed trained scoring mo del). This is the standard split conformal assumption that yields distribution-free finite-sample v alidit y for cov erage, and it is the only distributional condition used throughout. All guarantees are scop ed to this exc hangeable setting; we do not consider cov ariate or lab el shift in this pap er. In this w ork we focus on binary classification to isolate the core principles and trade-offs in the clearest setting; extensions to ric her output spaces are left for future work. 2.2 Sco res and calibration thresholds Let Y = { 1 , . . . , K } , and let s : X × Y → R denote a nonconformity (score) function (Shafer & V ovk, 2008; Lei et al., 2018). A common choice for probabilistic-like classifiers is s ( x, y ) = 1 − P ( y | x ) . While this sp ecific form is not required for the present section, it is used in the examples throughout the pap er. Giv en D cal , compute calibration nonconformity scores S i := s ( X c i , Y c i ) , i = 1 , . . . , n cal , and let S (1) ≤ · · · ≤ S ( n cal ) denote the sorted v alues. Split conformal calibration selects a threshold as an order statistic (P apadop oulos et al., 2002; V o vk, 2012a) τ := S ( k ) , where the index k ∈ { 1 , . . . , n cal } is determined by the desired miscov erage level α ⋆ (up to deterministic tie- breaking). It is often conv enient to parameterize the same grid by the misco v erage index u := n cal + 1 − k , corresp onding to the discrete miscov erage level α grid = u/ ( n cal + 1) . W e use class-c onditional split conformal throughout (V ovk, 2012a;b), where thresholds τ y are computed separately from calibration scores restricted to Y i = y . The p o oled split conformal conv ention is recov ered as the sp ecial case with shared thresholds across classes, i.e., τ y ≡ τ for all y . The key op erational p oint is that, given ( D cal , α ⋆ ) , calibration pro duces a vector of class-conditional thresh- olds τ = ( τ 1 , . . . , τ K ) , and these thresholds are then held fixed throughout deploymen t. 2.3 Regions, observability , and p olicies Once thresholds are fixed, they induce a finite discrete region by recording whic h threshold inequalities are satisfied. In the multiclass case it is conv enient to write the score vector s ( x ) := ( s ( x, 1) , . . . , s ( x, K )) and (optionally class-conditional) thresholds τ := ( τ 1 , . . . , τ K ) , with τ y ≡ τ in the p o oled case. The region map is defined by R τ ( x ) :=  1 { s ( x, 1) ≤ τ 1 } , . . . , 1 { s ( x, K ) ≤ τ K }  ∈ { 0 , 1 } K . Th us calibration fixes a region map R τ : X → { 0 , 1 } K , i.e., a finite partition of score space used throughout deplo ymen t. The map R τ is determined entirely b y τ . How ever, the induced distribution of class lab els Y giv en the region R τ ( X ) —which regions carry mass and ho w muc h—dep ends on the deplo ymen t data- generating pro cess. This discrete region partitioning is our decision and exploration interface. 5 Figure 2: Region definitions and region mass under differen t data regimes. (A) Thresholds τ 0 , τ 1 induce a finite region partition R τ ( x ) indep endent of data and p olicy . (B–C) Under probabilit y-normalized scores, data support is restricted to the diagonal, determining whic h regions carry nonzero mass for different thresh- old configurations. (D) With unconstrained scores, all regions may carry mass. No deploymen t p olicy is applied in this figure. Op erationally , deplo yment is observ ed through the joint process ( R τ ( X ) , Y ) . All quantities considered in this pap er are functions of region frequencies and region-conditioned label comp osition under fixed thresholds. W e treat region identit y as the auditable primitive; within-region heterogeneity is inten tionally not mo deled in this framew ork. Accordingly , observ ations that share the same realized region label R τ ( X ) = r are treated as exchangeable for the purp ose of estimation and certification. T o formally link regions to downstream consequences, we introduce the concept of a deplo ymen t p olicy . A deplo ymen t p olicy π maps a realized region R τ ( x ) to a rep orted output, suc h as a (p ossibly empty) set of lab els—with absten tion represen ted b y ∅ —or a single committed lab el. The deploy ed predictor is written as ˆ C π ( x ) := π ( R τ ( x )) . The standard split conformal predictor corresp onds to the set inclusion p olicy (Romano et al., 2019; Barb er et al., 2021) π SI ( r ) := { y ∈ Y : r y = 1 } , yielding ˆ C π SI ( x ) = { y ∈ Y : s ( x, y ) ≤ τ y } , with τ y ≡ τ in the p o oled case. The motiv ation for introducing p olicies is to separate calibration from action. Calibration fixes thresholds and therefore a finite set of region outcomes; deploymen t then chooses how to act on those outcomes. W riting ˆ C π ( x ) = π ( R τ ( x )) lets us swap or optimize decision rules for op erational objectives (commit/defer/reject) without rewriting the calibration step, and it pro vides a common language for auditing and comparing p olicies. As an example b ey ond set inclusion, consider a region-triggered commit policy . In the binary case Y = { 0 , 1 } , let A ⊆ { 0 , 1 } 2 and a ∈ { 0 , 1 } , and define π A → a ( r ) = ( { a } , r ∈ A, ∅ , r / ∈ A. In words, the p olicy commits to action a only when the realized conformal region b elongs to the pre-specified trigger set A ; otherwise it abstains. Thus A acts as an explicit commitmen t gate on the fixed calibrated region partition. Figure 3 visualizes several other deploymen t p olicies acting as deterministic pro jections on a fixed region structure. 6 Figure 3: Deploymen t policies as pro jections on a fixed region structure. P anel A shows the region partition R τ ( x ) . Panels B–D apply different region-based deplo yment p olicies π to the same regions and data supp ort, yielding rep orted outputs ˆ C π ( x ) = π ( R τ ( x )) . Differences betw een panels arise solely from the choice of policy . 2.4 A uditing primitives: region–lab el tables and p olicy p rojections Conditional on D cal , the deplo yed rule is fixed, and the primitive observ able for auditing is the pair ( R τ ( θ ) ( X ) , Y ) on an exchangeab le lab eled split. T w o representations are useful: the full region–lab el ta- ble, and KPI indicators obtained as its pro jections. Here θ ∈ Θ denotes a calibration setting that determines thresholds τ ( θ ) and hence the region map R τ ( θ ) . (A) Region–lab el join t table. Define the calibration-conditional joint probabilities p r,y ( θ ) := P  R τ ( θ ) ( X ) = r, Y = y | D cal  , ( r , y ) ∈ R × Y . These are fixed (but unknown) constants for the deploy ed rule. The audit set is used to estimate them via the counts K audit r,y ( θ ) := n audit X i =1 1 { R τ ( θ ) ( X a i ) = r , Y a i = y } , b p audit r,y ( θ ) := K audit r,y ( θ ) n audit . Note that the region map R τ ( θ ) is fixed from D cal , while the counts K audit r,y ( θ ) are computed on D audit only . Because R τ ( θ ) is a finite partition of score space, the region–label even ts ( R τ ( θ ) ( X ) = r, Y = y ) form a finite partition of the sample space. Consequently , for every θ , X r ∈R X y ∈Y p r,y ( θ ) = 1 , X r ∈R X y ∈Y b p audit r,y ( θ ) = 1 . Moreo v er, the lab el marginals are fixed by the deploymen t data-generating pro cess: X r ∈R p r,y ( θ ) = P ( Y = y | D cal ) , X r ∈R b p audit r,y ( θ ) = b P audit ( Y = y ) , so v arying θ reallo cates mass across regions within each lab el. These conserv ation laws are the basic source of geometric coupling and explain why sweeping θ pro duces a constrained set of rate vectors rather than filling an ambien t volume. (B) P olicy-sp ecific indicators. Fix a deploymen t p olicy π . An y op erational quantit y of interest can b e written as a Bernoulli indicator I ℓ := g ℓ ( R τ ( θ ) ( X ) , Y ; π ) ∈ { 0 , 1 } , p ℓ ( θ ) := P  I ℓ = 1 | D cal  . These op erational targets are auditable ev en t rates that summarize deploymen t b eha vior (e.g., abstention, decisiv eness, decisive error exp osure, or class-conditional region mass). F or example, the abstention indicator 7 is g abs ( r , y ; π ) = 1 { π ( r ) = ∅ } , and a decisive error indicator is g err ( r , y ; π ) = 1 {| π ( r ) | = 1 , y / ∈ π ( r ) } . The corresp onding audit coun t is K audit ℓ ( θ ) := n audit X i =1 g ℓ ( R τ ( θ ) ( X a i ) , Y a i ; π ) , b p audit ℓ ( θ ) := K audit ℓ ( θ ) n audit . Pro jection iden tit y . The tw o representations are linked b y linearity: every linear KPI is obtained by summing selected region–lab el cells. On the audit sample, K audit ℓ ( θ ) = X r ∈R X y ∈Y g ℓ ( r , y ; π ) K audit r,y ( θ ) , b p audit ℓ ( θ ) = K audit ℓ ( θ ) n audit . (1) T aking exp ectation yields the corresp onding identit y for latent rates: p ℓ ( θ ) = X r ∈R X y ∈Y g ℓ ( r , y ; π ) p r,y ( θ ) . Th us the region–lab el table is a reusable sufficient audit summary: audit it once, then compute p olicy-level KPIs by pro jection. The same structure extends to monotone transformations of linear KPIs (e.g., ratios or log-o dds), since the region–lab el table remains the sufficient summary . W orked example: cov erage as a pro jection of the region–lab el table. Consider Y = { 0 , 1 } and the set-inclusion p olicy π SI . F or fixed thresholds, each input falls into one of four regions R = { r 10 , r 11 , r 01 , r 00 } , and the primitive auditing ob ject is ( R τ ( X ) , Y ) ∈ R × Y . The corresp onding region–lab el table is P ( θ ) = y = 0 y = 1 ˆ C π SI ( X ) r 10 = (1 , 0) p 10 , 0 ( θ ) p 10 , 1 ( θ ) { 0 } r 11 = (1 , 1) p 11 , 0 ( θ ) p 11 , 1 ( θ ) { 0 , 1 } r 01 = (0 , 1) p 01 , 0 ( θ ) p 01 , 1 ( θ ) { 1 } r 00 = (0 , 0) p 00 , 0 ( θ ) p 00 , 1 ( θ ) ∅ with X r ∈R X y ∈Y p r,y ( θ ) = 1 . Under π SI , the ev en t { Y ∈ ˆ C π SI ( X ) } corresp onds to the four bold-indicated cells ( r 10 , 0) , ( r 11 , 0) , ( r 11 , 1) , and ( r 01 , 1) , hence p cov ( θ ) = p 10 , 0 ( θ ) + p 11 , 0 ( θ ) + p 11 , 1 ( θ ) + p 01 , 1 ( θ ) = 1 −  p 10 , 1 ( θ ) + p 01 , 0 ( θ ) + p 00 , 0 ( θ ) + p 00 , 1 ( θ )  . Note that the calibration c hoices ( α ⋆ 0 , α ⋆ 1 ) determine the class-conditional misco verage primitives—namely the joint rates p 10 , 1 ( θ ) + p 00 , 1 ( θ ) for Y = 1 and p 01 , 0 ( θ ) + p 00 , 0 ( θ ) for Y = 0 —while the decomposition of these totals in to wrong-singleton errors versus abstentions (including the o verall empty-set rate p 00 , 0 ( θ ) + p 00 , 1 ( θ ) ) is dictated b y the deploymen t distribution. The ab o v e example mak es explicit b oth (i) the pro jection form (co v erage is a sum of cells) and (ii) the conserv ation constrain t: once thresholds are fixed, cov erage is determined by ho w mass is redistributed across regions. The same “sum selected cells” structure applies to abstention/deferral, decisiveness, and decisive error ex- p osure under any fixed p olicy . This is the sense in whic h op erational quantities b ecome first-class: it yields a single auditable summary from which man y KPIs can b e computed (and later renegotiated) by pro jection, while making the conserv ation constrain ts that drive trade-offs explicit. 3 Op erational quantities as first-class objects Deplo ymen t b ehavior is summarized b y operational rates induced b y a calibrated threshold geometry together with a fixed deploymen t p olicy π . Consistent with Section 2, we keep three lay ers distinct: (i) geometry , giv en by thresholds and the induced region map R τ ( θ ) ; (ii) p olicy , given by a deterministic pro jection π : R → 2 Y ; and (iii) rates , which quan tify ho w often auditable region–lab el even ts occur under deploymen t. 8 The goal of this section is to mak e op erational quantities first-class : to sp ecify which rate statemen ts are certifiable, which data split is required, and how trade-offs are explored without committing to a scalar ob jectiv e. Our p ersp ectiv e is explicitly predictive: we seek conserv ative, distribution-free statements ab out future realized op erational b ehavior ov er finite windows, conditional on the one-time calibration draw. The in tended output is an auditable trade-off map θ 7→ r ( θ ) linking calibration c hoices to op erational profiles, together with an efficiency b oundary . Here θ ∈ Θ denotes a calibration setting, i.e., an index for the deploy ed thresholds τ ( θ ) and induced region map R τ ( θ ) . 3.1 Calib rate–and–Audit Bey ond marginal co v erage, distribution-free finite-sample statements require an indep endent audit split (Bian & Barb er, 2023; Gibbs et al., 2025). Cov erage is sp ecial b ecause its split-conformal guarantee is rank-based. F or general region–lab el rates, reusing D cal couples region–lab el indicators to the random threshold and breaks the exc hangeable Bernoulli mo del (Geisser, 1993; Casella & Berger, 2002) needed for exact predictive en v elop es; therefore certified env elop es require an indep endent D audit (App endix D). Using the audit split defined in Section 2, we keep thresholds (and thus R τ ( θ ) ) fixed from D cal and compute all op erational rate estimates and predictiv e env elop es on D audit only . As with the train/calibrate separation, once the rule dep ends on data, certification of downstream b ehavior requires fresh exchangeable ev aluation data. When a separate D audit cannot b e afforded, w e pro vide a conserv ativ e pro cedure that aims to decouple threshold selection from rate estimation by ev aluating each calibration p oint under a leav e-one-out (LOO) pro xy threshold that do es not use that p oint. This proxy is intended for feasibility exploration rather than certification. In sim ulations, this pro cedure tracks the certified tw o-split reference closely; see App endix D. 3.2 Certified predictive envelop es for future windows Section 2 defines the auditable primitive: the joint pro cess ( R τ ( θ ) ( X ) , Y ) under fixed thresholds, together with KPI indicators obtained as linear pro jections of the corresp onding region–lab el table (with cov erage as a work ed example). Under Calibrate–and–A udit, conditional on D cal the deploy ed rule (hence R τ ( θ ) and π ) is fixed, and the audit p oints are exchangeable with future deploymen t p oin ts. Therefore, for any fixed indicator I ℓ , K audit ℓ ( θ ) | D cal ∼ Binomial  n audit , p ℓ ( θ )  , and for a future op erational window of size m with draws ( X ′ j , Y ′ j ) , the future realized count K m ℓ ( θ ) := m X j =1 g ℓ ( R τ ( θ ) ( X ′ j ) , Y ′ j ; π )    D cal ∼ Binomial  m, p ℓ ( θ )  . These conditional Binomial laws (Johnson et al., 1997) induce distribution-free finite-sample predictive en- v elop es for K m ℓ ( θ ) , and hence for the realized window rate b R ℓ,m = K m ℓ ( θ ) /m , based on the audit count K audit ℓ ( θ ) observed on D audit . The Beta–Binomial form is not a prior assumption on p ℓ ( θ ) . It is a closed-form expression for the predic- tiv e distribution of K m ℓ ( θ ) induced b y the t wo-stage design: conditional on D cal , the audit indicators and future-windo w indicators are i.i.d. Bernoulli trials with a common (unknown) success probability p ℓ ( θ ) . Ob- serving the audit coun t K audit ℓ ( θ ) therefore determines exact finite-sample predictiv e tails for K m ℓ ( θ ) under exc hangeabilit y , which can b e written in Beta–Binomial form. As a conserv ative alternative, the same audit rate b p audit ℓ ( θ ) admits Ho effding-t yp e deviation inequalities (Ho effding, 1963) for the latent rate p ℓ ( θ ) . These yield worst-case planning en v elop es for future window rates; we record them in App endix D. 9 3.3 Rate vectors, attainable op erational sets, and Pa reto filtering Certified en v elop es (Section 3.2) apply to any fixed region–lab el even t rate at any fixed calibration setting, and to any linear pro jection thereof. Deploymen t planning is comparative: stakeholders wan t to understand ho w op erational b ehavior v aries across calibration settings, which combinations of rates are feasible for a giv en scoring mo del and cov erage semantics, and where structural b ottlenecks lie. This motiv ates treating the r ate ve ctor (Miettinen, 1999) as the decision interface. Let Θ denote a family of calibration settings (e.g., an α -grid or a class-conditional threshold family), and write R τ ( θ ) for the induced region map at setting θ . Fix a deploymen t p olicy π and a list of L auditable KPIs, eac h represented b y an indicator I ℓ = g ℓ ( R τ ( θ ) ( X ) , Y ; π ) . Define the calibration-conditional rate vector r ( θ ) =  r 1 ( θ ) , . . . , r L ( θ )  , r ℓ ( θ ) := P  I ℓ = 1 | D cal  . Eac h co ordinate is obtained by pro jection from the same audited region–lab el table via (1). A ttainable op erational set and visualization. Sweeping θ traces the set of achiev able op erational profiles under the fixed p olicy π , V (Θ; π ) := { r ( θ ) : θ ∈ Θ } ⊂ R L . Because the region–lab el table ob eys conserv ation constrain ts (sum-to-one and fixed label marginals), V (Θ; π ) is typically a constrained set: changing θ reallo cates probability mass across regions and th us across KPIs, rather than tuning rates independently . T rade-off plots (e.g., Figure 1) displa y selected coordinates of b r audit ( θ ) , optionally annotated with finite-window predictive env elop es computed from the pro jected audit coun ts. Orien tation and P areto filtering (no cost mo del). T o summarize negotiation-relev ant regimes without committing to a scalar objective, we apply an oriente d P areto filter. Concretely , the analyst chooses, for eac h display ed KPI, whether it is desirable to b e large or small (e.g., low er decisive error exp osure is b etter; higher purity is b etter). This is an orientation choic e , not a pricing mo del. F ormally , fix an orientation v ector s ∈ {− 1 , +1 } L and compare p oin ts using s ⊙ r ( θ ) . A setting θ is nondominate d if there is no θ ′ suc h that s ℓ r ℓ ( θ ′ ) ≥ s ℓ r ℓ ( θ ) for all ℓ, and s ℓ 0 r ℓ 0 ( θ ′ ) > s ℓ 0 r ℓ 0 ( θ ) for some ℓ 0 . The nondominated set is the ne gotiation fr ont : an y later decision rule that imp oses hard constraints or in tro duces scalar w eigh ts must select an op erating p oint on (or near) this front. The resulting trade-off map answers a planning question: within the p ermissible calibration family Θ , what op erational behaviors are attainable under the fixed policy π , and which of those behaviors survive as negotiation-relev ant regimes? Predictiv e env elop es then attac h finite-windo w uncertain ty to the fron tier p oin ts, supp orting robust regime selection under op erational constraints, or can b e included to shap e the pareto frontier directly . 3.4 SSBC as a calibration navigation co o rdinate A trade-off map is most useful in practice when it admits a user-facing means of navigation tied to an agreed- up on semantic constraint. In this w ork we use cov erage as that semantic anc hor. Cov erage is sp ecial among the KPIs considered ab ov e: under exc hangeabilit y , split conformal admits an exact finite-sample rank/Beta c haracterization of the c alibr ation-c onditional cov erage (Marques, 2025). SSBC inv erts this la w to translate a semantic request ( α ⋆ , δ ) into a concrete discrete choice on the conformal calibration grid (App endices B and C). Split conformal calibration selects thresholds on a finite grid of order statistics; equiv alently , it can b e indexed b y the discrete miscov erage level α grid = u/ ( n cal + 1) with u ∈ { 1 , . . . , n cal } (or by the equiv alen t order- statistic index k = n cal + 1 − u ; see Section 2.2). SSBC maps ( α ⋆ , δ ) to a single admissible index u ⋆ ( α ⋆ , δ ) b y selecting the least conserv ative grid p oint whose induced calibration-conditional co verage satisfies the 10 P AC-st yle tail requirement P D cal  P  Y ∈ ˆ C ( X ) | D cal  ≥ 1 − α ⋆  ≥ 1 − δ. Concretely , SSBC inv erts the exact rank/Beta law and returns the b oundary grid p oin t that satisfies the δ -tail constraint with minimal conserv atism. This reduction is imp ortan t for the geometry studied in Section 4: although a user sp ecifies tw o parameters ( α ⋆ , δ ) , SSBC returns a single discrete index u ⋆ , hence a single threshold choice, App endix C. That choice fixes the region map R τ ( θ ) and therefore fixes all KPIs in r ( θ ) . In p o oled split conformal there is one such index, yielding a one-parameter navigation along the conformal grid. In class-conditional calibration, SSBC is applied separately within each class. In the binary case this yields tw o indices ( u ⋆ 0 , u ⋆ 1 ) =  u ⋆ ( α ⋆ 0 , δ 0 ) , u ⋆ ( α ⋆ 1 , δ 1 )  , whic h determine ( τ 0 , τ 1 ) and hence the calibration setting θ . Thus, while the user-facing sp ecification ap- p ears four-dimensional ( α ⋆ 0 , δ 0 , α ⋆ 1 , δ 1 ) , SSBC collapses it to a t w o-dimensional navigation co ordinate on the calibration grid. Implication for attainable sets. Because θ is determined b y these discrete indices, the calibration family Θ admits a low-dimensional parameterization (one-dimensional in the p o oled case, tw o-dimensional in the binary Mondrian case) even before accounting for the conserv ation constraints in the region–lab el table. Section 4 then makes the resulting coupling explicit in the probability-normalized binary setting and explains the sharp regime b oundaries observed when sweeping these navigation indices. 4 Consequences of a fixed confo rmal pa rtition in the binary case Fix a trained scoring function and a realized calibration draw. Calibration selects thresholds τ and therefore a fixed region pro cess R τ ( X ) (Section 2.3). All deploymen t-facing quantities in this pap er are functionals of the joint region–lab el structure ( R τ ( X ) , Y ) (Section 3). In the binary case, treating the conformal interface as the ob ject of planning has three concrete consequences: (i) fe asibility is c ouple d —v arying τ reallocates probabilit y mass across finitely many region t yp es rather than tuning rates independently; (ii) p osteriors ar e pie c ewise-c onstant —any region-measurable p olicy acts only through within-region lab el comp osition; and (iii) distribution-fr e e is not c ost-agnostic —once actions are wired to regions, coherence with a cost or utility mo del imp oses compatibility constrain ts that dep end on the region–lab el table (Appendix F). W e mak e these p oin ts explicit under probability-normalized scores. 4.1 Bina ry conformal geometry and regime b ounda ries Let Y = { 0 , 1 } and fix class-conditional thresholds τ = ( τ 0 , τ 1 ) . The deplo y ed interface is the region lab el R τ ( x ) :=  1 { s ( x, 0) ≤ τ 0 } , 1 { s ( x, 1) ≤ τ 1 }  ∈ { 0 , 1 } 2 , with outcomes 10 , 11 , 01 , 00 (singleton- 0 , hedge, singleton- 1 , abstention under π SI ). Probabilit y-normalized scores induce a regime b oundary . F or probability-normalized scores s ( x, y ) = 1 − P ( y | x ) we hav e s ( x, 0) + s ( x, 1) = 1 . Hence a p oint cannot satisfy b oth class thresholds unless τ 0 + τ 1 ≥ 1 , and it cannot violate b oth unless τ 0 + τ 1 ≤ 1 . Consequently: • Hedging regime ( τ 0 + τ 1 > 1 ): 11 ma y o ccur and 00 cannot (singletons + hedges; no abstention). • Rejection regime ( τ 0 + τ 1 < 1 ): 00 ma y occur and 11 cannot (singletons + absten tion; no hedges). • Boundary ( τ 0 + τ 1 = 1 ): only 10 and 01 o ccur; under π SI outputs are alwa ys singletons. Crossing τ 0 + τ 1 = 1 therefore changes which outcome t yp es can app ear. This is a deterministic constraint induced by the score normalization, not an estimation artifact. Note that when τ 0 = τ 1 = 1 / 2 , the asso ciated classification rule is the argmax rule and cov erage equals the classifier’s accuracy . 11 Cross-threshold coupling within regimes. Within the hedging regime ( τ 0 + τ 1 > 1 ), the diagonal supp ort is partitioned into three contiguous interv als. Parameterize b y u = s ( x, 0) (so s ( x, 1) = 1 − u ): u ∈ [0 , 1 − τ 1 ) ⇒ R τ ( x ) = 10 , u ∈ [1 − τ 1 , τ 0 ] ⇒ R τ ( x ) = 11 , u ∈ ( τ 0 , 1] ⇒ R τ ( x ) = 01 . Th us region b oundaries are gov erned by opp osing thresholds: changing τ 1 mo v es the b oundary that controls the mass of 10 , while changing τ 0 mo v es the b oundary that controls the mass of 01 . Thresholds therefore act as mass-reallo cation b oundaries, not indep endent p er-class knobs. Sw eeping calibration settings mo ves τ and reallocates mass across regions. When score supp ort is low- dimensional, small changes in τ can trigger supp ort discontin uities, and within regimes op erational co ordi- nates are coupled. Under probability-normalized scores, swapping ( τ 0 , τ 1 ) exc hanges the roles of regions 11 and 00 . If the deploymen t p olicy treats hedging and absten tion identically , then this swap do es not c hange action-lev el b ehavior, ev en though set-inclusion cov erage changes. 4.2 Region–lab el tables, piecewise-constant p osteriors, and coherent action Fix τ and treat the region lab el R τ ( X ) ∈ { 0 , 1 } 2 as the only deplo yed observ able. The joint region–lab el probabilities p r,y := P ( R τ ( X ) = r, Y = y | D cal ) , r ∈ { 0 , 1 } 2 , y ∈ { 0 , 1 } , fully characterize the information a v ailable to any region-measurable p olicy . W rite p r := p r, 0 + p r, 1 for region mass. Because the interface tak es finitely many v alues, the region-conditioned lab el comp osition is constant within each region; downstream ev aluation and decision-making therefore reduce to functions of ( p r, 0 , p r, 1 ) , aggregated across regions using the same table { p r,y } . Definition (cost-coherence for region-based action). Let A be an action set and let L ( a, y ) be the cost of taking action a ∈ A when Y = y . A region-based p olicy ˜ π : { 0 , 1 } 2 → A induces calibration-conditional exp ected cost R ( ˜ π ) := X r ∈{ 0 , 1 } 2 X y ∈{ 0 , 1 } L ( ˜ π ( r ) , y ) p r,y . W e say ˜ π is c ost-c oher ent for the deploy ed interface if, for every region with p r > 0 , it minimizes conditional risk within that region: ˜ π ( r ) ∈ arg min a ∈A P y ∈{ 0 , 1 } L ( a, y ) p r,y p r for all r with p r > 0 . Because R ( ˜ π ) separates across regions for unconstrained region-based p olicies, this lo cal condition is equiv a- len t to global optimality (ties arbitrary). W e use the lo cal form to state p er-region compatibility conditions. 4.3 Confo rmal partitions are not cost-agnostic under coherent action Cost-coherence makes a simple p oint op erational: once a conformal partition is deploy ed, a downstream con v ention is rational only relativ e to the information carried by the region lab el and the costs of av ailable actions. In particular, even if a region is lab eled as a singleton under π SI , coherence is determined by the region–lab el comp osition ( p r, 0 , p r, 1 ) , not by the set lab el itself. Reject-option example (Cho w-style absten tion). F ollo wing Cho w (1970), consider actions A = { 0 , 1 , rej } with false positive cost c 10 > 0 , false negative cost c 01 > 0 , rejection cost c rej ≥ 0 , and zero cost for correct commitments. F or each region r , define p r := p r, 0 + p r, 1 . Cost-coherence is decided by comparing the three region-level alternatives: c ho ose 0 : p r, 1 c 01 ≤ p r, 0 c 10 and p r, 1 c 01 ≤ p r c rej , c ho ose 1 : p r, 0 c 10 ≤ p r, 1 c 01 and p r, 0 c 10 ≤ p r c rej , c ho ose rej : p r c rej ≤ p r, 1 c 01 and p r c rej ≤ p r, 0 c 10 . 12 Th us, for a fixed conv ention, coherence means its prescrib ed action satisfies the corresp onding inequalities in every region (ties allow ed). F or example, if the p olicy commits to class 0 in region 10 (singleton output { 0 } under π SI ), then coherence requires p 10 , 1 c 01 ≤ p 10 , 0 c 10 and p 10 , 1 c 01 ≤ ( p 10 , 0 + p 10 , 1 ) c rej . This also shows wh y the hard-wired con ven tion “commit on singleton outputs; defer otherwise” is cost-coheren t only if, in eac h singleton region with p r > 0 , the forced commitment has low er exp ected cost than b oth rejection and the opp osite commitment: a region can b e lab eled by a singleton set under π SI y et still fav or rejection (or ev en the opp osite commitment) for plausible cost ratios. More generally , coherence of any fixed (interface, con v ention) pair can b e expressed as explicit inequalities in the region–label table { p r,y } and cost ratios suc h as ( c rej /c 01 , c rej /c 10 ) ; see App endix F. That app endix also giv es an inv erse view: for a fixed conv ention, c haracterize the set of cost ratios under which it is coherent. 5 Results: validation and deplo yment-facing planning This section rep orts results aligned with the pap er’s deploymen t-facing claims for a single deploy ed scoring mo del: the mo del is treated as fixed infrastructure and we v ary only calibration settings (hence the induced region geometry). W e fo cus on three questions: (i) SSBC cov erage seman tics under calibration randomness and finite deplo yment windows, (ii) predictive env elop es for deploymen t-facing op erational rates b eyond co v erage, and (iii) scenario planning via trade-off maps with cost-coherence screening. 5.1 Numerical Simulations This subsection v alidates t wo core claims: SSBC preserves the in tended finite-windo w cov erage seman- tics under calibration randomness, and Calibrate–and–Audit provides well-cen tered predictiv e env elop es for deplo ymen t-facing op erational rates. T able 1 summarizes seman tic cov erage b ehavior, and Figure 4 compares en v elop e alignment b et w een tw o-sample and LOO constructions. 5.1.1 Coverage: numerical realization of SSBC guarantees This exp erimen t instan tiates the exact finite-sample laws underlying SSBC in the finite-windo w deplo y- men t regime, separating calibration randomness from deploymen t-window noise. Calibration nonconformity scores are drawn i.i.d. from a con tin uous, heavy-tailed reference distribution, chosen as the absolute Cauch y distribution | t ν =1 | to remain agnostic to mo del sp ecifics. F or each calibration set size n cal , split conformal selects a threshold via a discrete misco v erage index u ∈ { 1 , . . . , n cal } with α grid = u n cal + 1 , k = n cal + 1 − u, τ = S ( k ) . W e compare three index-selection rules at target ( α ⋆ , δ ) = (0 . 10 , 0 . 10) : nominal split conformal, a DKWM correction (Massart, 1990; Dvoretzky et al., 1956), and SSBC, App endix C. In the latter case, we distinguish b et w een the infinite window and finite windo w cases that follow a Beta and Beta-Binomial la w, resp ectively . T able 1 sho ws representativ e calibration sizes (full grid in App endix C). Nominal split conformal under- con trols calibration-conditional risk; DKWM enforces v alidity through strong conserv atism. SSBC directly in v erts the relev ant finite-windo w law, yielding violation probabilities close to the target δ up to grid effects. 5.1.2 Op erational rate envelop es: LOO versus tw o-sample reference W e next v alidate predictiv e env elop es for operational rates b eyond co v erage, where no distribution-free rank pivot exists. The goal is to compare the certified t wo-sample Calibrate–and–Audit construction to a single-sample LOO surrogate intended for feasibility exploration. Syn thetic probabilit y model. W e generate predicted probabilities and labels so the score geometry is controlled and in terpretable. Eac h dra w pro duces Y ∈ { 0 , 1 } with P ( Y = 1) = p class , where p class ∈ { 0 . 10 , 0 . 50 } . Conditional on Y , w e draw P 1 via P 1 | Y = 1 ∼ Beta( a, b ) , P 1 | Y = 0 ∼ Beta(2 , 7) , 13 T able 1: Representativ e calibration-conditional co verage violation rates. T arget misco v erage α ⋆ = 0 . 10 , confidence δ = 0 . 10 , and deploymen t window size m infer = 100 . W e rep ort three represen tative calibration sizes; the complete grid is in T able 7. n cal Metho d u α grid Obs BetaTheory BBTheory α cont 100 None 10 0.0990 0.4075 0.4513 0.4071 – 100 SSBC 6 0.0594 0.0960 0.0576 0.0956 – 100 DKWM 1 0.0099 0.0004 0.0000 0.0004 − 0 . 0224 200 None 20 0.0995 0.4130 0.4655 0.4124 – 200 SSBC 13 0.0647 0.0974 0.0320 0.0980 – 200 DKWM 2 0.0100 0.0000 0.0000 0.0000 +0 . 0135 500 None 50 0.0998 0.4153 0.4782 0.4153 – 500 SSBC 34 0.0679 0.0949 0.0049 0.0955 – 500 DKWM 22 0.0439 0.0097 0.0000 0.0096 +0 . 0453 with ( a, b ) ∈ { (4 , 3) , (9 , 3) } , form ( P 0 , P 1 ) = (1 − P 1 , P 1 ) , and use class-conditional scores S y := 1 − P y . T wo-sample reference and LOO surrogate. F or eac h configuration, we draw indep enden t datasets D 1 and D 2 of size N = 500 . W e calibrate on D 1 (including SSBC-adjusted thresholds at ( α, δ ) = (0 . 10 , 0 . 10) in the finite-window regime), freeze the induced rule, and ev aluate op erational indicators on D 2 , yielding exact Binomial/Beta–Binomial predictive env elop es. As a single-sample surrogate, we compute LOO op erational indicators by ev aluating each p oint under thresholds recomputed without that p oint, p o ol LOO counts, and map them to env elop es; optionally we widen interv als via controlled p essimization (App endix D). Figure 4 sho ws that LOO-based en velopes closely match the tw o-sample reference in b oth location and width across all four geometries, while inflation widens in terv als without shifting cen ters. This supp orts using LOO as a practical proxy for planning when an explicit audit split is unav ailable, while keeping certification tied to the tw o-sample design. Figure 4: Op erational rate env elop es and underlying score distributions. (A) Singleton rate and (B) sin- gleton error for t wo class prev alences ( p class = 0 . 10 , 0 . 50 ) and t wo class 1 generating distributions ( Beta(4 , 3) and Beta (9 , 3) ), with class 0 generated by Beta(2 , 7) . Red rectangles denote the tw o-sample Calibrate–and– A udit Beta–Binomial (BB) predictive env elop e, and the dashed vertical line is the corresp onding BB p oin t estimate. Blue and orange markers with horizontal in terv als show leav e-one-out (LOO) env elop es computed from a single calibration dataset under t w o inflation levels ( infl = 1 , 2 ). (C1–C4) Histograms of the score geometry shown as predicted class 1 probabilit y (equiv alently , nonconformity to class 0) stratified b y true class. These distributions make explicit how class prev alence and calibration geometry shap e singleton mass and singleton error, and explain the regime-dep endent asymmetry of the finite-sample env elop es. 14 These simulations supp ort the calibration and env elop e semantics used in the tw o real-data studies b elow. 5.2 T ox21: predictive envelop es under severe class imbalance The T ox21 b enchmar k (Mayr et al., 2016; Huang et al., 2016) stress-tests the framework under severe class im balance, where minority-class calibration counts can b e well b elow 100 (T able 2). W e ev aluate (i) SSBC seman tic cov erage b eha vior under small class-conditional calibration sizes and (ii) finite-windo w op erational en v elop es in a practical setting. T able 3 summarizes aggregated semantic b ehavior; endpoint-lev el operational en v elop e comparisons app ear in T able 4 and App endix H. T asks and mo deling infrastructure. T ox21 comprises tw elve binary toxicit y endp oin ts (NR and SR path w ays). Eac h endp oint is treated as an indep endent binary classification problem. Molecules are repre- sen ted using RDKit (Landrum, 2025) 2D descriptors augmen ted with Morgan fingerprints (Rogers & Hahn, 2010) (radius 2 , 128 bits), and classification is p erformed using CatBo ost (Prokhorenko v a et al., 2018) with fixed hyperparameters. F or each endp oin t, data are split randomly into training (50%), calibration (25%), and test (25%), and results are a v eraged ov er 100 splits. Conformal calibration is class-conditional and we compare: (i) standard split conformal, (ii) DKWM correction, and (iii) SSBC, all at ( α, δ ) = (0 . 10 , 0 . 10) . T able 2: T ox21 dataset comp osition and effective class-conditional calibration sizes under the exp erimental proto col. Endpoint T otal Samples Positiv e Rate Calib. Positives Calib. Negatives NR-AR 7265 4.3% 77 1739 NR-AR-LBD 6758 3.5% 59 1630 NR-AhR 6549 11.7% 192 1445 NR-Aromatase 5821 5.2% 75 1380 NR-ER 6193 12.8% 198 1350 NR-ER-LBD 6955 5.0% 87 1651 NR-PP AR- γ 6450 2.9% 46 1566 SR-ARE 5832 16.2% 235 1223 SR-A T AD5 7072 3.7% 66 1702 SR-HSE 6467 5.8% 93 1523 SR-MMP 5810 15.8% 229 1223 SR-p53 6774 6.2% 105 1587 5.2.1 Coverage semantics under small class-conditional calib ration sizes T able 3 aggregates empirical co verage, violation rates, and prediction-set statistics ov er all endp oints and splits. Nominal split conformal exhibits elev ated violation probability in this conditional small- n regime. DKWM enforces v alidity through conserv atism, inflating set sizes and reducing singleton frequency . SSBC substan tially reduces violation probability while retaining more decisiveness than DKWM. T able 3: Aggregated co v erage and prediction-set statistics on T ox21. Results are av eraged o ver t welv e endp oin ts and 100 random splits. Violation rate is the empirical probability that co verage falls b elow the target level. Method Cov erage Violation Rate A vg. Set Size Singleton Rate Standard Split Conformal 0.917 0.305 1.41 0.52 DKWM Correction 0.986 0.005 1.78 0.22 SSBC (proposed) 0.951 0.068 1.54 0.40 5.2.2 Op erational envelop es as p redictive summaries (LOO versus a tw o-sample reference) With co verage risk fixed, we ev aluate predictive env elop es for operational quantities ov er a finite deplo yment windo w. Op erational KPIs are region–lab el indicators (Section 2.4); in the binary case, singleton mass, doublet mass, and wrong-singleton mass provide a compact KPI set tied directly to decisiv eness and decisive error exp osure. 15 T able 4: SR-MMP endp oint. Op erational p erformance summary for the SR-MMP endp oint. All reported quan tities are join t probabilities normalized b y the total test-set size. Ro ws report the singleton and doublet join t even t rates P ( Y = c, | S | = 1) and P ( Y = c, | S | = 2) , together with the wrong-singleton rate P ( Y = c, | S | = 1 , ˆ y  = Y ) , by true class. “point estimate D 1 ” and “LOO 95% PI D 1 ” denote lea ve-one-out p oint estimates and their b eta–binomial predictive in terv als computed on the calibration data. “Observed D 2 ” rep orts empirical rates on an indep endent hold-out set. “BB 95% PI D 2 ” gives the b eta–binomial predictive in terv als for the corresp onding D 2 quan tities, accounting for calibration uncertaint y and finite test size. Op erational quantit y Class p oint estimate D 1 LOO 95% PI D 1 Observ ed D 2 BB 95% PI D 2 Singleton rate Class 0 0 . 662 [0 . 604 , 0 . 718] 0 . 651 [0 . 615 , 0 . 686] Class 1 0 . 130 [0 . 092 , 0 . 173] 0 . 117 [0 . 094 , 0 . 142] Doublet rate Class 0 0 . 163 [0 . 121 , 0 . 211] 0 . 201 [0 . 172 , 0 . 231] Class 1 0 . 045 [0 . 023 , 0 . 074] 0 . 031 [0 . 020 , 0 . 046] W rong-singleton rate Class 0 0 . 073 [0 . 044 , 0 . 107] 0 . 070 [0 . 052 , 0 . 090] Class 1 0 . 012 [0 . 002 , 0 . 030] 0 . 011 [0 . 005 , 0 . 020] W e compare the t wo-sample Calibrate–and–A udit reference to the single-sample LOO surrogate (Ap- p endix D). T able 4 provides a representativ e endp oin t-lev el breakdown; an additional endp oin t is reported in App endix H (T able 8). LOO p oint estimates trac k tw o-sample observed rates and interv als are comparable (often slightly wider), consistent with LOO primarily affecting predictive disp ersion rather than centering. 5.3 R3: scena rio planning on aqueous solubility W e now demonstrate the planning use-case: exploring attainable deploymen t-facing trade-offs across cali- bration settings without committing to a single scalar ob jective. W e fo cus on aqueous solubility in drug dev elopmen t using AquaSolDB (Sorkun et al., 2019). The predictive mo del is treated as fixed infrastruc- ture; v ariation is entirely through the calibration lay er (SSBC), the induced conformal partition, and the do wnstream interpretation of the resulting finite interface. Metho dological setup (do cumen ted in App endix I). This experiment is designed to separate (i) generalization across chemical space from (ii) exchangeable ev aluation for conformal and finite-window fore- casting. W e therefore use scaffold-isolated training, follow ed by an exchangeable calibration/test design on the held-out scaffold-disjoint p o ol (Appendix I, dataset/partitioning/mo del details). Although the base clas- sifier is trained on a three-class discretization of log S (Insoluble / Mo derate / Soluble), deploymen t-facing planning is p osed as a binary campaign ob jectiv e by merging Moderate and Soluble into a single Soluble class and calibrating prediction sets in { Ins , Sol } (App endix I, calibration and op erationalization). Op erational planning additionally requires a scenario definition: rather than assuming the full calibration p o ol matches deploymen t, we restore exchangeabilit y by design for a sp ecified deploymen t scenario by re- stricting calibration to a chemically defined subp opulation (App endix I, Section I.8). In this study the fo cal scenario is a lip ophilic deploymen t regime { MolLogP > 3 . 5 } , and SSBC thresholds (and all env elop es deriv ed from them) are computed using the restricted calibration sample. Op erating points and audited outcomes. F or each SSBC setting, calibration fixes thresholds and therefore a discrete binary conformal interface (singleton- { Sol } , singleton- { Ins } , or doublet { Sol , Ins } ). W e in terpret singletons as decisiv e commitmen ts and the doublet as deferral, and summarize each setting b y the join t outcome allo cation P ( Y = y , C = c ) ov er y ∈ { Sol , Ins } and c ∈ {{ Sol } , { Ins } , { Sol , Ins }} (T able 5). T able 5: Op erational outcome categories induced b y binary prediction sets. T rue label Y C ( X ) = { Sol } C ( X ) = { Ins } C ( X ) = { Ins , Sol } Sol TP (correct Sol singleton) FN (loss) HP (hedged Sol) Ins FP (w aste) TN (correct Ins singleton) HN (hedged Ins) 16 Join t rates are the auditable primitiv e: they form a complete mass allocation ov er the set–lab el outcome space and map directly to expected finite-window counts (e.g., m × P ( Y = y, C = c ) for a window of size m ). Because the conformal grid is discrete and b ehavior is determined by threshold geometry , many nominally distinct ( α 0 , δ 0 , α 1 , δ 1 ) settings collapse to identical realized interfaces and join t allo cations; we therefore deduplicate settings and retain the least conserv ative representativ e among equiv alence classes (App endix I). T rade-off map and cost-coherence analysis. Figure 5 em b o dies the main results of this study . The left panel plots the attainable set in a KPI plane spanned b y tw o soluble-class joint rates: irreversible exclusion P ( Y = Sol , C = { Ins } ) and deferral burden P ( Y = Sol , C = { Sol , Ins } ) , with color enco ding the decisive- correct soluble mass P ( Y = Sol , C = { Sol } ) . W e highligh t nondominated regimes using an orientation-only P areto filter ov er these three soluble-class quantities, yielding a shortlist of negotiation-relev ant op erating regimes. Because a deploy ed conformal predictor is an information interface, we also screen regimes for rationality of a fixed downstream wiring given only the conformal output (Section 4.3). The right panel summarizes cost- coherence of the action p olicy { Sol } → Sol , { Ins } → Ins , { Sol , Ins } → rej ov er relative cost ratios λ = c 01 /c 10 and ρ = c rej /c 10 . This analysis indicates that p oin ts on the Pareto front are cost-coherent for only a limited range of cost ratios: if the cost of irreversible exclusion is to o high relative to the cost of deferral, or the cost of deferral is too high relative to the cost of irreversible exclusion, then the action p olicy yields decisions that yield higher exp ected costs than the alternative actions in those regions. A dditional results and do cumentation (app endix). App endix I provides the supp orting metho dolog- ical record and supplementary empirical summaries used to in terpret Figure 5: (i) dataset and label con- struction, scaffold-isolated partitioning, features, and mo del proto col; (ii) scenario-conditional (trib e-based) calibration restriction and its in terpretation; and (iii) diagnostics on the functional roles of ( α 0 , δ 0 , α 1 , δ 1 ) along the Pareto fron t, together with a discussion of how α - and δ -v ariation act as coarse v ersus fine adjust- men ts under finite-sample SSBC geometry (App endix I, Sections I.9 – I.10). 6 Discussion This w ork uses conformal prediction as deploy ed decision infrastructure, where the ob ject of in terest is not a one-off marginal guarantee but the realized b ehavior of a fixed rule ov er a finite op erational window. In that regime, the key quan tities are op erational rate v ectors—commitmen t, abstention, and error exp osure— that gov ern downstream workload and harm. The cen tral message is that cov erage is sp ecial: b ecause it dep ends only on the rank of the future true-lab el score among calibration scores (David & Nagara ja, 2003), its calibration-conditional v ariability admits an exact, distribution-free finite-sample characterization. Most other deploymen t KPIs do not share this piv ot structure; they dep end on multi-score geometry and threshold in teractions, so v alidit y must b e handled by explicit auditing and uncertaint y propagation. The first contribution, Small-Sample Beta Correction (SSBC), addresses a practical mismatch b etw een a user’s requested cov erage level and what nominal split conformal delivers when the calibration set is small. The sim ulations sho w that nominal grid selection can yield large calibration-conditional violation probabili- ties, driv en purely by calibration randomness, ev en when exchangeabilit y holds. SSBC restores an operational seman tics by selecting the calibration index through inv ersion of the exact Beta or Beta–Binomial laws, de- p ending on whether one targets the long-run or finite-window co verage even t. In contrast, DKWM-style corrections (Massart, 1990) ac hiev e control primarily through conserv atism, often pushing the system to o v erly abstain. SSBC therefore offers a useful middle ground: it is still distribution-free, but it allo cates conserv atism only where the finite-sample semantics demand it. The second contribution, Calibrate–and–Audit, makes explicit a fact that is often implicit in practice: once calibrated, a conformal rule induces a finite partition of score space with a small set of region lab els, and do wnstream systems inevitably map those region lab els to actions. By estimating the region masses and region–lab el composition, the audit table marginalizes the calibration-induced geometry in to a reusable empirical in terface. This enables forw ard-lo oking predictive en velopes for op erational rates ov er future 17 Figure 5: Solubilit y scenario planning via conformal operating regimes and cost-coherence. Left: attainable joint-rate map induced by sweeping SSBC calibration settings (after deduplication). Axes rep ort joint rates on soluble comp ounds: irrev ersible exclusion P ( Y = Sol , C = { Insol } ) (x-axis) and deferral burden P ( Y = Sol , C = { Sol , Insol } ) (y-axis). Color encodes the joint decisiv e-correct rate P ( Y = Sol , C = { Sol } ) . Red rings mark Pareto-optimal op erating p oints; thick rings mark unique KPI regimes, labeled by regime_id:multiplicity . Right: cost-coherence landscap e for the fixed do wnstream con v ention { Sol } → Sol , { Insol } → Insol , { Sol , Insol } → rej , parameterized by cost ratios λ = c 01 /c 10 (irre- v ersible loss / downstream waste) and ρ = c rej /c 10 (deferral / downstream waste). The heatmap shows the fraction of Pareto regimes for whic h this conv ention is cost-coherent given only the conformal output (via within-region lab el comp osition). Blue and red outlines give the union and intersection of feasible ( λ, ρ ) regions across the Pareto front; thin outlines show representativ e wedges for the unique KPI regimes. windo ws, and supp orts scenario planning o v er attainable trade-offs across calibration settings. The results illustrate that op erational fronts can b e explored without retraining the base mo del, and that uncertain t y in these fronts is quan tifiable when one treats calibration as a one-time random even t and deplo yment as finite-windo w sampling. A key conceptual implication is the feasibility coupling induced by a fixed partition. Because the same conformal interface must serve multiple downstream ob jectives, not all com binations of commit and error exp osure are simultaneously achiev able by tuning a single scalar calibration parameter. The regime structure observ ed in trade-off curves is therefore not an artifact of a particular synthetic mo del; it follows from the constrained geometry of the induced regions. This p ersp ective also clarifies when cost-aw are op eration is coheren t. A ction coherence dep ends on within-region lab el frequencies; if a region is lab el-mixed, forced commitmen t can b e cost-incoherent across plausible cost ratios. Sev eral limitations delineate the scop e of these guaran tees. First, the distribution-free statemen ts rely on exc hangeabilit y; dataset shift or mo del drift (Tibshirani et al., 2019; F annjiang et al., 2022) breaks the exact finite-sample laws and must b e handled by monitoring, re-calibration (V ovk, 2015; Barb er et al., 2021), or domain-adaptiv e v ariants. Second, SSBC is tied to the discrete split-conformal grid; una v oidable gran ularity app ears at small n cal , and very strict ( α ⋆ , δ ) targets may b e unattainable without larger calibration sets. Third, the audit-based env elop es inherit v ariance from region sparsity; rare regions may require p o oling, hierarc hical smo othing, or targeted data collection. Finally , w e focused on binary classification for clarity . Extending the action-coherence analysis to multi-class, structured outputs, or complex abstention policies is an imp ortant next step. Ov erall, the pap er argues for separating co v erage guarantees from op erational planning, i.e. what workload and harm rates a deploy ed conformal in terface will induce. SSBC supplies the former, while Calibrate–and– 18 A udit supplies the latter. T ogether, they supp ort a deploymen t workflo w in which co verage is treated as a calibrated contract and the remaining KPIs are treated as audited, forecasted quan tities with explicit, distribution-free finite-sample uncertain t y . 7 Conclusion W e studied conformal prediction in the setting that matters for deploymen t: a single calibrated rule reused o v er finite operational windo ws. In this regime, cov erage alone admits an exact, distribution-free calibration- conditional law, while other op erational KPIs require explicit auditing. SSBC provides a deterministic calibration-index selection rule that aligns a user’s cov erage request with a P AC-st yle operational seman tics under small calibration sets, including the finite-window case via the induced Beta–Binomial law. Calibrate–and–A udit complemen ts this by turning the calibrated conformal predictor in to a reusable region–lab el interface, enabling predictiv e env elop es for future commitment, abstention, and error-exp osure rates and supp orting scenario planning ov er attainable trade-offs. T aken together, these to ols mo v e conformal prediction (Angelop oulos & Bates, 2023) from a single cov erage guaran tee to ward a practical planning framew ork: cov erage is treated as a contract with calibrated seman tics, and the rest of the deploy ed b ehavior is forecasted and stress-tested with finite-sample uncertaint y . F uture w ork should fo cus on robust monitoring under shift, partition refinement to improv e decision coherence, and extensions to ric her output spaces and downstream p olicies. 19 A App endix Outline & Notation This app endix serves as a notation guide for the tech nical app endices that follow. The remaining material is organized as follows: • SSBC and co v erage pivots: deriv ations and calibration-conditional guarantees for grid selection, finite-sample b ehavior, and pivot-based uncertain t y control. • Operational rates and LOO justification: construction of deplo ymen t-facing rate en v elop es, together with single-sample leav e-one-out surrogates and their assumptions. • Geometry and binary partitions: structural analysis for region definitions and indicator decom- p ositions used in op erational auditing. • Empirical case studies: additional exp erimental details and results for T ox21 and Solubility . The notation b elow is shared across split conformal calibration, the Small-Sample Beta Correction (SSBC), and finite-windo w prediction of op erational rates. The analysis mixes three la yers: (i) discrete split-conformal thresholding via order statistics on calibration scores; (ii) SSBC grid selection and calibration-conditional co v erage semantics; and (iii) predictive en velopes for deploymen t-facing op erational rates (Calibrate-and- A udit and a single-sample leav e-one-out (LOO) surrogate). T o av oid collisions, we reserve ( k , u ) exclusively for split conformal thresholding. P o oled LOO indicator counts use N loo . 20 T able 6: Key notation used across calibration, SSBC, and op erational analysis. Symbol Meaning Scope / Remarks Data splits and window sizes D cal Calibration dataset Exchangeable sample used to set conformal thresholds; size n cal . D audit Audit / ev aluation dataset Exchangeable sample used to estimate region– label or even t counts for operational env elop es. n cal Calibration sample size Number of p oints in D cal . m F uture (deplo yment) window size Number of future cases o ver whic h realized rates (and predictiv e env elop es) are defined. Split c onformal thr esholding and SSBC s ( x, y ) Nonconformity score function Real-v alued score for candidate label y at input x ; larger means “less conforming. ” S i := s ( X i , Y i ) T rue-lab el calibration scores Scores computed on D cal ; exchangeable with the future score S n +1 . S ( k ) k th order statistic of { S i } n cal i =1 Used to define the deplo yed split-conformal threshold τ . k Order-statistic index Split conformal threshold index; k ∈ { 1 , . . . , n cal } . u := n cal + 1 − k Miscoverage grid index Equivalen t parametrization of the conformal grid; u ∈ { 1 , . . . , n cal } . α grid := u n cal + 1 Grid-aligned misco verage lev el Miscov erage induced by selecting index u on the discrete conformal grid. τ := S ( k ) Deploy ed conformal threshold Fixed after calibration; determines the prediction-set map C ( · ) . α ⋆ T arget (requested) misco ver- age User-facing miscov erage level prior to discretiza- tion / SSBC selection. δ Confidence (risk) parameter T ail probability for calibration-conditional under-cov erage ev ents (P AC-st yle semantics). p cov ( D cal ) Calibration-conditional cov er- age probabilit y p cov := P ( Y ∈ C ( X ) | D cal ) ; random ov er cal- ibration draws. Under exchangeabilit y , p cov ∼ Beta( k , u ) for split conformal. b C m Finite-window empirical co ver- age b C m := 1 m P m j =1 1 { Y ′ j ∈ C ( X ′ j ) } ov er a future window of size m ; m = ∞ denotes the infinite- window (calibration-conditional) regime. Op er ational indic ators, r ates, and pr e dictive envelop es C ( X ) Conformal prediction set Set-v alued output of the deploy ed rule at input X . g ℓ ( · ) Operational even t functional Indicator map g ℓ ( C ( X ) , Y ) ∈ { 0 , 1 } defining KPI/even t ℓ (e.g., singleton, wrong-singleton, hedged-positive). I ℓ,i Operational indicator (de- ploy ed rule) I ℓ,i := g ℓ ( C ( X i ) , Y i ) on an ev aluation or deplo y- ment sample. K m ℓ F uture event coun t o ver a win- dow K m ℓ := P m j =1 I ℓ,j ; the ob ject go verned by Binomial/Beta–Binomial predictiv e laws. ˆ R ℓ := 1 m P m j =1 I ℓ,j Realized op erational rate ov er a windo w Empirical rate ˆ R ℓ = K m ℓ /m ov er a window of size m . Single-sample LOO surr o gate (one dataset r euse) Z ℓ,i LOO operational indicator Z ℓ,i := g ℓ ( C ( − i ) ( X i ) , Y i ) where C ( − i ) is cali- brated on D cal \ { ( X i , Y i ) } . N loo ,ℓ := P n cal i =1 Z ℓ,i Pooled LOO indicator count LOO “success” count used to form LOO-based estimates and en velopes for ev ent ℓ ; distinct from conformal ( k , u ) . ˆ r loo ,ℓ := N loo ,ℓ n cal Pooled LOO rate estimate Conv enience notation for the p o oled LOO even t frequency . infl Inflation factor (con trolled p es- simization) V ariance/width inflation used to hedge against dependence from single-sample reuse; imple- mented via m eff = m/ infl . 21 B Exact finite-sample law of calib ration-conditional coverage This app endix derives an exact finite-sample characterization of the c alibr ation-c onditional c over age of a fixed split conformal predictor under exchangeabilit y . The key p oint is that co v erage is a pure r ank even t: the future true-lab el score is compared to a calibration order statistic. This yields a distribution-free pivot and an exact Beta law for realized (calibration-conditional) cov erage across calibration draws. This pivot is the input to SSBC (App endix C). B.1 Setup Let D cal = { ( X i , Y i ) } n i =1 b e exchangeable with a future test pair ( X n +1 , Y n +1 ) . Let s ( x, y ) b e a nonconformity score and define S i := s ( X i , Y i ) , i = 1 , . . . , n, and S n +1 := s ( X n +1 , Y n +1 ) . Fix an index k ∈ { 1 , . . . , n } and set τ := S ( k ) , u := n + 1 − k, α grid = u n + 1 . F or the predictor calibrated on D cal , the c alibr ation-c onditional c over age pr ob ability is p cov ( D cal ) := P ( S n +1 ≤ τ | D cal ) . This random v ariable v aries across calibration draws but is fixed once calibration is completed. B.2 Rank pivot Assume for clarit y that the score distribution is con tinuou s so ties o ccur with probability zero (ties are addressed in Remark 1). Under exchangeabilit y of { S 1 , . . . , S n , S n +1 } , the rank R := rank  S n +1 among S 1 , . . . , S n , S n +1  is uniform on { 1 , . . . , n + 1 } (David & Nagara ja, 2003). Since τ = S ( k ) , { S n +1 ≤ τ } ⇐ ⇒ { R ≤ k } . Th us cov erage is the conditional probability of a rank even t. B.3 Exact distribution: Beta law A standard order-statistic identit y implies that the conditional probability of the rank even t equals the k th order statistic of n + 1 i.i.d. uniforms: p cov ( D cal ) d = U ( k ) , U ( k ) ∼ Beta( k , u ) , u = n + 1 − k , where U ( k ) is the k th order statistic of U 1 , . . . , U n +1 i . i . d . ∼ Unif (0 , 1) (David & Nagara ja, 2003). Equiv alently , for any t ∈ [0 , 1] , P ( p cov ( D cal ) ≤ t ) = I t ( k , u ) , where I t ( · , · ) is the regularized incomplete Beta function. The law is distribution-free and dep ends only on ( n, k ) (equiv alently ( n, u ) ). R emark 1 (Discrete scores and ties) . If scores ha ve atoms, ranks are not almost surely unique. Exact piv ots can b e recov ered by randomized tie-breaking; deterministic left/right-con tin uous conv entions yield conserv ative b ounds. The non-interpolated order-statistic thresholding conv ention used in the main text preserv es finite-sample v alidit y . 22 Wh y cov erage is sp ecial. Cov erage dep ends only on the rank of the future true-lab el score relativ e to the calibration scores, hence admits an exact finite-sample distribution-free law. Most other operational KPIs dep end on the joint geometry of multiple lab el scores and threshold interactions and therefore do not admit an analogous pivot (App endix D). C Small-Sample Beta Correction (SSBC) SSBC is a deterministic index-selection rule for split conformal calibration that makes a user request ( α ⋆ , δ ) op erationally precise for the single deploye d pr e dictor obtained after one calibration draw. SSBC selects a conformal grid index (equiv alently an order statistic) so that, with probability at least 1 − δ o v er calibration randomness, the realized cov erage of the deploy ed rule is at least 1 − α ⋆ . In the finite-window v ariant, the same guarantee is imp osed on empirical co verage ov er a future window of size m . The construction relies only on the exact finite-sample law of calibration-conditional cov erage from App endix B. C.1 Setup and conformal grid Let n cal b e the calibration size and let S i := s ( X i , Y i ) denote true-lab el nonconformit y scores for i = 1 , . . . , n cal . Split conformal selects a threshold as an order statistic: τ = S ( k ) , k ∈ { 1 , . . . , n cal } . It is conv enien t to re-index the same grid by the miscov erage index u := n cal + 1 − k ∈ { 1 , . . . , n cal } , α grid = u n cal + 1 , k = n cal + 1 − u. Here α ⋆ ∈ (0 , 1) is the requested miscov erage and δ ∈ (0 , 1) is a confidence/risk level controlling the proba- bilit y (ov er calibration draws) that the requested semantics fail. C.2 Exact law of realized (calib ration-conditional) coverage F or the fixed predictor calibrated on D cal , define the calibration-conditional cov erage probability p cov ( D cal ) := P ( S n +1 ≤ τ | D cal ) , where ( X n +1 , Y n +1 ) is exchangeable with the calibration sample. Under exchangeabilit y , App endix B shows that p cov ( D cal ) ∼ Beta ( k , u ) , k = n cal + 1 − u, exactly and distribution-free. This law describ es how the realized cov erage of the deploy ed predictor v aries across hypothetical recalibrations, while treating the deploy ed rule as fixed after the one calibration step. C.3 SSBC objective: calib ration-conditional P AC semantics SSBC enforces the calibration-conditional P A C-st yle constraint P D cal ( p cov ( D cal ) ≥ 1 − α ⋆ ) ≥ 1 − δ. Using the Beta law, this is equiv alent to P ( Z ≥ 1 − α ⋆ ) ≥ 1 − δ, Z ∼ Beta( k , u ) . Selection rule (least conserv ative admissible grid p oin t). Among discrete grid indices u ∈ { 1 , . . . , n cal } , SSBC selects the lar gest admissible u satisfying the tail constraint ab ov e: u ⋆ := max n u ∈ { 1 , . . . , n cal } : P ( Z ≥ 1 − α ⋆ ) ≥ 1 − δ, Z ∼ Beta( n cal + 1 − u, u ) o . Equiv alently , SSBC deploys the least conserv ative grid miscov erage level α adj = u ⋆ / ( n cal + 1) that certifies the requested seman tics. The returned order-statistic index is k adj = n cal + 1 − u ⋆ . 23 C.4 Finite-windo w deployment semantics In many deploymen ts, cov erage is ev aluated o v er a finite window of size m . Define empirical cov erage ov er that window b y b C m := 1 m m X j =1 I cov ,j , I cov ,j := 1 { Y ′ j ∈ C ( X ′ j ) } , S m := m b C m . Conditional on the calibration-conditional cov erage probability p cov ( D cal ) = p , S m | p ∼ Binomial( m, p ) . Marginalizing p ∼ Beta ( k , u ) yields the exact mixture law S m ∼ Beta - Binomial( m ; k , u ) , k = n cal + 1 − u. The finite-window SSBC criterion selects u such that P  b C m ≥ 1 − α ⋆  ≥ 1 − δ, where the probabilit y is ov er b oth calibration randomness and the future window. C.4.1 Strict low er-tail convention Because b C m is discrete, we adopt a strict violation conv ention b C m < 1 − α ⋆ . Define the corresp onding count threshold x ⋆ := ⌊ (1 − α ⋆ ) m ⌋ + 1 , so that { b C m ≥ 1 − α ⋆ } ⇐ ⇒ { S m ≥ x ⋆ } . All Beta–Binomial tail probabilities in SSBC are ev aluated for the even t { S m ≥ x ⋆ } . This a void s b oundary am biguit y when (1 − α ⋆ ) m is an integer and preserves monotonicity in u . C.5 Infinite-windo w limit Prop osition 2 (Infinite-windo w limit of SSBC) . Fix n cal , α ⋆ , and δ . L et u m b e the SSBC-sele cte d index obtaine d by inverting the Beta–Binomial tail for window size m , and let u ∞ b e the SSBC-sele cte d index obtaine d by inverting the Beta tail. Then u m − → u ∞ as m → ∞ . Pr o of sketch. Conditional on p cov = p , b C m → p almost surely as m → ∞ . Therefore the Beta–Binomial mixture conv erges weakly to the Beta law for p cov , and the corresp onding admissibility conditions for grid p oin ts conv erge. □ C.6 F easibility and saturation Not ev ery pair ( α ⋆ , δ ) is feasible at fixed n cal b ecause calibration c hoices lie on the conformal grid. Under the most conserv ativ e grid p oin t u = 1 (equiv alently k = n cal ), p cov ∼ Beta( n cal , 1) , P ( p cov ≥ 1 − α ) = 1 − (1 − α ) n cal . Th us any infinite-windo w P AC requirement at confidence 1 − δ must satisfy α ≥ 1 − δ 1 /n cal . If this condition is violated (and lik ewise in the finite-window analogue), no grid p oint can satisfy the tail constrain t and SSBC returns Infeasible . 24 Figure 6: Seman tic in terpretation of nominal co verage requests under finite calibration. Eac h panel visualizes the effectiv e calibration lev el α adj selected b y SSBC as a function of the user-sp ecified misco v erage lev el α and confidence level δ , for fixed n cal . Color encodes the deplo yed level α adj , while con tours indicate iso-seman tic sets: distinct nominal requests that induce the same deploy ed calibration grid p oin t. The feasibilit y b oundary reflects the finite-sample constraint α ≳ 1 − δ 1 /n cal . C.7 Coverage semantics under finite calibration T o visualize how nominal requests map to deploy ed semantics under finite calibration, Figure 6 plots the effectiv e calibration lev el α adj selected by SSBC as a function of the user inputs ( α, δ ) at fixed n cal . Distinct nominal requests can induce the same deploy ed grid index and therefore the same realized cov erage semantics. C.8 SSBC algorithm (deterministic sp ecification) This subsection pro vides a repro ducible implemen tation-level specification. The algorithm returns the largest admissible u (least conserv ative grid p oint) satisfying the relev ant tail constraint. Relation to DKWM-st yle calibration. DKWM-st yle calibration modifies the nominal grid choice to enforce conserv ative, w orst-case guarantees uniformly o v er calibration draws and distributions. SSBC ad- dresses a different question: it assigns a calibration-conditional P AC meaning to a user request ( α ⋆ , δ ) for the single deploye d predictor pro duced after one calibration. DKWM targets uniform v alidit y across hypo- thetical recalibrations; SSBC targets admissibility of the realized rule via exact Beta (or Beta–Binomial) tails. C.9 Extended simulation table for calib ration-conditional violations T able 7 rep orts the full calibration-size grid for the simulation summarized in Section 5.1.1. D Single-sample structural coupling, leave-one-out decoupling, and envelop e inflation The tw o-stage predictive reference used throughout this pap er separates c alibr ation (c ho ose a conformal threshold on D cal ) from op er ational evaluation (estimate rates on an indep enden t windo w). This separation is what makes window indicators b ehav e as i.i.d. Bernoulli draws under a fixed deploy ed rule. 25 T able 7: Calibration-conditional co verage violation rates with theory . T arget miscov erage α ⋆ = 0 . 10 , confidence δ = 0 . 10 . α grid is the grid p oint selected on the conformal grid, α cont is the requested v alue under the DKWM correction, and m cal is the calibration-windo w size. Results are based on 10 6 calibration draws and a finite deploymen t window of size m infer = 100 . The Obs column rep orts P ( b C m < 1 − α ⋆ ) . BetaTheory rep orts P ( p cov < 1 − α ⋆ ) under p cov ∼ Beta( k , u ) . BBTheory rep orts P ( b C m < 1 − α ⋆ ) under the induced Beta–Binomial law. n cal Metho d u α grid Obs BetaTheory BBTheory α cont 50 None 5 0.0980 0.3963 0.4312 0.3964 – 50 SSBC 2 0.0392 0.0476 0.0338 0.0472 – 50 DKWM 1 0.0196 0.0096 0.0052 0.0095 − 0 . 0731 75 None 7 0.0921 0.3454 0.3673 0.3464 – 75 SSBC 4 0.0526 0.0769 0.0504 0.0768 – 75 DKWM 1 0.0132 0.0016 0.0004 0.0017 − 0 . 0413 100 None 10 0.0990 0.4075 0.4513 0.4071 – 100 SSBC 6 0.0594 0.0960 0.0576 0.0956 – 100 DKWM 1 0.0099 0.0004 0.0000 0.0004 − 0 . 0224 150 None 15 0.0993 0.4119 0.4602 0.4107 – 150 SSBC 9 0.0596 0.0804 0.0307 0.0801 – 150 DKWM 1 0.0066 0.0000 0.0000 0.0000 +0 . 0001 200 None 20 0.0995 0.4130 0.4655 0.4124 – 200 SSBC 13 0.0647 0.0974 0.0320 0.0980 – 200 DKWM 2 0.0100 0.0000 0.0000 0.0000 +0 . 0135 250 None 25 0.0996 0.4126 0.4692 0.4134 – 250 SSBC 16 0.0637 0.0863 0.0175 0.0858 – 250 DKWM 5 0.0199 0.0003 0.0000 0.0003 +0 . 0226 300 None 30 0.0997 0.4139 0.4719 0.4141 – 300 SSBC 20 0.0664 0.0969 0.0171 0.0971 – 300 DKWM 8 0.0266 0.0009 0.0000 0.0009 +0 . 0293 500 None 50 0.0998 0.4153 0.4782 0.4153 – 500 SSBC 34 0.0679 0.0949 0.0049 0.0955 – 500 DKWM 22 0.0439 0.0097 0.0000 0.0096 +0 . 0453 26 Algorithm 1 Small-Sample Beta Correction (SSBC) Require: T arget misco v erage α ⋆ ∈ (0 , 1) ; calibration size n cal ∈ N ; confidence δ ∈ (0 , 1) ; deploymen t regime ∈ {∞ , m } (window size m if finite) Ensure: A djusted grid level α adj and index k adj , or Infeasible 1: t ← 1 − α ⋆ 2: u ⋆ ← − 1 3: if regime = m then 4: x ⋆ ← ⌊ t m ⌋ + 1 5: end if 6: for u = 1 , . . . , n cal do 7: a ← n cal + 1 − u , b ← u 8: if regime = ∞ then 9: p tail ← Pr[ Z ≥ t ] , Z ∼ Beta( a, b ) 10: else 11: p tail ← Pr[ X ≥ x ⋆ ] , X ∼ Beta - Binomial( m ; a, b ) 12: end if 13: if p tail ≥ 1 − δ then 14: u ⋆ ← u 15: end if 16: end for 17: if u ⋆ < 0 then 18: return Infeasible 19: end if 20: α adj ← u ⋆ n cal +1 21: k adj ← n cal + 1 − u ⋆ 22: return α adj , k adj In practice, an indep enden t audit window is often unav ailable. When the same sample is reused b oth to select the threshold and to estimate op erational rates, threshold selection and ev aluation b ecome coupled. This app endix records (i) a minimal structural reason for reuse-induced dependence, (ii) a data-efficient remedy—lea v e-one-out (LOO) recalibration—that provides effective pr actic al de c oupling in our regimes, and (iii) an inflation parameter infl that p essimizes predictive env elop es when residual dep endence or regime instabilit y remains. D.1 Why single-sample reuse intro duces dep endence Let D cal = { ( X i , Y i ) } n i =1 b e an exchangeable calibration sample, let S i := s ( X i , Y i ) b e true-lab el nonconfor- mit y scores, and let the split conformal threshold b e the k th order statistic ˆ τ := S ( k ) , k ∈ { 1 , . . . , n } , assuming contin uity so ties occur with probability zero. F or any threshold t , define the crossing indicator I i ( t ) := 1 { S i ≤ t } . Lemma 3 (Order-statistic coupling under reuse) . A ssume S 1 , . . . , S n ar e exchange able and c ontinuous, and let ˆ τ = S ( k ) . Then, c onditional on ˆ τ = t , n X i =1 I i ( t ) = k almost sur ely . Conse quently, { I i ( t ) } n i =1 ar e not c onditional ly indep endent given ˆ τ = t , and for any i  = j , Co v( I i ( t ) , I j ( t ) | ˆ τ = t ) = k ( k − n ) n 2 ( n − 1) < 0 ( k < n ) . 27 Pr o of. Conditioning on ˆ τ = t fixes exactly k − 1 scores strictly b elo w t and one score equal to t , hence P n i =1 I i ( t ) = k deterministically . The cov ariance identit y follows from exchangeabilit y and counting. Implication for op erational en v elop es. Op erational indicators are functions of the deplo yed prediction set C ˆ τ ( X ) = { y : s ( X , y ) ≤ ˆ τ } and therefore inherit reuse-induced dep endence. If one naïvely treats reuse- based indicators as i.i.d. Bernoulli under a fixed rule, predictive env elop es can b ecome under-disp ersed. Empirically in our studies, the dominant distortion is in disp ersion : p oint estimates often remain close to a t w o-stage reference, while interv als can b e to o narrow without decoupling or p essimization. D.2 App roximate decoupling via leave-one-out (LOO) When an indep endent audit sample is unav ailable, w e use LOO recalibration (V ovk, 2015; Barb er et al., 2021) to reduce self-influence. Let ˆ τ − i b e the split conformal threshold computed on D cal \ { ( X i , Y i ) } , and let C − i ( · ) be the corresp onding prediction set map. F or an op erational even t functional g j ( C ( X ) , Y ) ∈ { 0 , 1 } (e.g., singleton, doublet, wrong-singleton), define LOO indicators Z i,j := g j ( C − i ( X i ) , Y i ) , i = 1 , . . . , n, and p o oled LOO summaries k po ol ,j := n X i =1 Z i,j , b r LOO j := k po ol ,j n . Pro xy t wo-stage interpretation. Each Z i,j is ev aluated under a rule that does not use p oint i , restoring a lo calized separation b etw een rule construction and ev aluation. Although the rule v aries across folds, each fold differs only sligh tly from the full calibrated rule, so { Z i,j } can b e read as indicators from nearby op erating regimes. Pooling pro vides a direct empirical proxy for op erational b ehavior under finite calibration. Empirical decoupling. In the regimes studied (Section 5.1.2), LOO en velopes track the tw o-stage Calibrate–and–A udit reference closely in center and often slightly p essimistically in width. W e therefore use p o oled LOO indicators for planning when a separate audit set is unav ailable. D.3 Envelop e inflation as controlled p essimization LOO is lo cal and do es not remo v e all dep endence, esp ecially near regime b oundaries where small threshold shifts can change region supp ort. W e therefore introduce an explicit inflation parameter infl ≥ 1 that widens predictive en v elop es by shrinking an effective sample size used in the predictive mo del. Op erational effect. In constructions b elo w we replace a nominal proxy sample size n by n eff = n/ infl (and analogously for any proxy count used to parameterize predictive disp ersion). Larger infl yields wider en v elop es without c hanging the p o oled mean, providing a monotone knob for conserv atism. Diagnostic guidance (optional). As a simple diagnostic of calibration-induced v ariabilit y , one can com- pute fold rates b r ( − i ) under each LOO-calibrated rule and their empirical v ariance r LOO = 1 n n X i =1 b r ( − i ) , d V ar LOO = 1 n − 1 n X i =1  b r ( − i ) − r LOO  2 . In our exp eriments this v ariance aligns qualitatively with inflation levels needed for conserv ative env elop e co v erage relative to the tw o-stage reference. D.4 Predictive envelop e constructions from LOO pro xies W e describ e tw o complementary en v elop e constructions built from LOO proxy indicators. The first mirrors the tw o-stage reference and is our primary approximation; the second is a conserv ative guardrail. 28 Predictiv e Beta–Binomial env elop es (primary approximation). Let Z i denote a p o oled proxy in- dicator for a fixed KPI (suppress j ), and let k po ol = P n i =1 Z i with p o oled pro xy rate b p = k po ol /n . Apply inflation via n eff = n infl , k eff = b p n eff = k po ol infl . With a small prior offset offset ∈ { 1 , 1 / 2 } , define α := k eff + offset , β := ( n eff − k eff ) + offset . F or a future deploymen t window of size m , the predictive count S m is mo deled as S m ∼ BetaBinomial( m, α , β ) , and equal-tailed prediction interv als follow from the Beta–Binomial CDF. Increasing infl shrinks n eff and widens interv als monotonically while preserving the p o oled mean. Ho effding-t yp e dominance b ound (guardrail). As a conserv ative alternativ e, Ho effding’s inequality giv es a distribution-free b ound for a window rate b r m around a proxy mean b r LOO : Pr( | b r m − b r LOO | ≥ ϵ ) ≤ 2 exp( − 2 mϵ 2 ) , yielding simple symmetric env elop es that can b e used as a worst-case planning chec k. Summary . Single-sample reuse induces structural dep endence through order-statistic thresholding (Lemma 3), primarily distorting predictive disp ersion . LOO recalibration reduces self-influence and em- pirically reco v ers audit-st yle b eha vior in our regimes. Predictive env elop es are then constructed from p o oled LOO pro xies using a Beta–Binomial mo del, with infl providing a monotone knob for con trolled p essimiza- tion and a Ho effding b ound serving as a conserv ative guardrail. E Bina ry confo rmal partitions: regimes, coupling, and rate p rimitives This app endix records the geometric facts used in Section 2, Section 3, and Section 4. W e work in the c alibr ation-c onditional viewp oint: thresholds are treated as fixed (conditional on the realized calibration dra w), and all probabilities are taken with resp ect to the deploymen t distribution conditional on D cal . E.1 Mondrian split conformal as a four-region partition Let Y = { 0 , 1 } and let s ( x, y ) be a nonconformit y score. Mondrian (class-conditional) split conformal pro duces thresholds τ 0 , τ 1 , and the set-v alued output is C ( x ) = { 0 : s ( x, 0) ≤ τ 0 } ∪ { 1 : s ( x, 1) ≤ τ 1 } , (2) equiv alently represented by the region lab el R τ ( x ) :=  1 { s ( x, 0) ≤ τ 0 } , 1 { s ( x, 1) ≤ τ 1 }  ∈ { 0 , 1 } 2 , τ = ( τ 0 , τ 1 ) . W riting ( s 0 , s 1 ) = ( s ( x, 0) , s ( x, 1)) , the thresholds partition score space into R τ ( x ) =          11 , s 0 ≤ τ 0 , s 1 ≤ τ 1 (doublet) , 10 , s 0 ≤ τ 0 , s 1 > τ 1 (singleton { 0 } ) , 01 , s 0 > τ 0 , s 1 ≤ τ 1 (singleton { 1 } ) , 00 , s 0 > τ 0 , s 1 > τ 1 (absten tion) . F or any fixed τ , X r ∈{ 00 , 01 , 10 , 11 } µ r ( τ ) = 1 , µ r ( τ ) := Pr( R τ ( X ) = r | D cal ) . 29 E.2 Probabilit y-normalized scores and a sharp regime b oundary Man y probabilistic classifiers induce probability-normalized scores, e.g. s ( x, y ) = 1 − P ( y | x ) . F or Y = { 0 , 1 } this implies s ( x, 0) + s ( x, 1) = 1 , so feasible score pairs lie on the diagonal manifold M = { ( u, 1 − u ) : u ∈ [0 , 1] } . In tersecting M with the threshold rectangles yields a sharp b oundary that determines which region types can o ccur with nonzero mass. Prop osition 4 (Regime b oundary under probabilit y normalization) . A ssume ( s 0 , s 1 ) ∈ M almost sur ely. Then: 1. R 11 has nonempty interse ction with M iff τ 0 + τ 1 ≥ 1 , and has p ositive-length interse ction iff τ 0 + τ 1 > 1 . 2. R 00 has nonempty interse ction with M iff τ 0 + τ 1 ≤ 1 , and has p ositive-length interse ction iff τ 0 + τ 1 < 1 . 3. On the b oundary τ 0 + τ 1 = 1 , b oth R 11 and R 00 interse ct M at a single p oint; henc e under any c ontinuous distribution on M they have pr ob ability zer o and only the singleton r e gions c arry mass. Pr o of. P arameterize M by u = s 0 ∈ [0 , 1] , so s 1 = 1 − u . R 11 requires u ≤ τ 0 and 1 − u ≤ τ 1 , i.e. u ∈ [ 1 − τ 1 , τ 0 ] . This interv al has p ositive length iff 1 − τ 1 < τ 0 , i.e. τ 0 + τ 1 > 1 , and degenerates to a p oin t at equality . R 00 requires u > τ 0 and 1 − u > τ 1 , i.e. u ∈ ( τ 0 , 1 − τ 1 ) . This interv al has p ositiv e length iff τ 0 < 1 − τ 1 , i.e. τ 0 + τ 1 < 1 , and degenerates to a p oin t at equality . Crossing the affine b oundary τ 0 + τ 1 = 1 therefore remov es an en tire region lab el (doublet or abstention) from the supp ort of R τ ( X ) under probability normalization, explaining sharp regime changes in attainable op erating b ehavior. E.3 Cross-threshold dominance in the hedging regime Within the hedging regime τ 0 + τ 1 > 1 , the manifold M is partitioned in to three con tiguous in terv als corresp onding to { 10 , 11 , 01 } . With u = s ( x, 0) : u ∈ [0 , 1 − τ 1 ) ⇒ R τ ( x ) = 10 , u ∈ [ 1 − τ 1 , τ 0 ] ⇒ R τ ( x ) = 11 , u ∈ ( τ 0 , 1] ⇒ R τ ( x ) = 01 . Hence each singleton region is con trolled primarily b y the opp osing threshold: increasing τ 1 expands 11 at the exp ense of 10 , while increasing τ 0 expands 11 at the exp ense of 01 . Op erationally , ( τ 0 , τ 1 ) act as mass-r e al lo c ation b oundaries , not indep endent p er-class knobs. E.4 Region–lab el primitives and rate factorizations F or auditing and planning, the primitive ob ject is the region–lab el table p r,y ( τ ) := Pr( R τ ( X ) = r, Y = y | D cal ) , r ∈ { 00 , 01 , 10 , 11 } , y ∈ { 0 , 1 } . T wo derived summaries are µ r ( τ ) := Pr( R τ ( X ) = r | D cal ) = X y ∈{ 0 , 1 } p r,y ( τ ) , η r ( τ ) := Pr( Y = 1 | R τ ( X ) = r, D cal ) = p r, 1 ( τ ) µ r ( τ ) ( µ r ( τ ) > 0) . An y region-measurable KPI is a pro jection of { p r,y ( τ ) } ; see Section 2.4. 30 Example: decisiv e error masses under “commit-on-singletons” . Consider the conv ention 10 7→ 0 , 01 7→ 1 , 11 , 00 7→ defer . Then the decisiv e false-negative and false-p ositive masses are FN dec ( τ ) = Pr( R τ ( X ) = 10 , Y = 1 | D cal ) = p 10 , 1 ( τ ) = µ 10 ( τ ) η 10 ( τ ) , FP dec ( τ ) = Pr( R τ ( X ) = 01 , Y = 0 | D cal ) = p 01 , 0 ( τ ) = µ 01 ( τ ) [1 − η 01 ( τ )] . The decisive mass is µ 10 ( τ ) + µ 01 ( τ ) , while the defer mass is µ 11 ( τ ) + µ 00 ( τ ) (with feasibility of 11 v ersus 00 gov erned b y Prop osition 4 under probability normalization). F Cost-coherence and inverse p ricing envelop es induced b y a fixed conformal interface This app endix formalizes the op erational p oint that distribution-free calibration is not automatically cost- agnostic once conformal outputs are wired into actions. Fix thresholds τ (calibration-conditional viewp oint) and treat the induced finite region label R τ ( X ) ∈ R as the deploy ed observ able. An y downstream rule that uses only this interface can dep end on the data only through r ∈ R , so whether the conv en tion is coheren t with a cost mo del is ev aluated r e gion-wise using within-region lab el frequencies (we formalize this as c ost-c oher enc e b elow). W e cast this as an inverse pricing problem: given a fixed (interface, conv ention) pair, characterize the set of consequence prices under which the conv ention is coherent. F.1 Interface primitives: masses and within-region lab el frequencies Sp ecialize to Y = { 0 , 1 } and R = { 00 , 01 , 10 , 11 } as in App endix E. Adopt the calibration-conditional joint table p r,y := Pr( R τ ( X ) = r, Y = y | D cal ) . Define region mass and within-region lab el frequency µ r := p r, 0 + p r, 1 , η r := Pr( Y = 1 | R τ ( X ) = r, D cal ) = p r, 1 µ r ( µ r > 0) . Because R is finite, conditional uncertain t y ab out Y is piecewise constant: within region r , all decision comparisons reduce to η r . F.2 Region-measurable action conventions and coherence Let A b e a finite action set and let L θ ( a, y ) b e a priced consequence (cost or negative utility), parameterized b y θ ∈ Θ . A deploy ed con ven tion is a region-measurable p olicy ˜ π : R → A . F or region r with µ r > 0 , the interface-relativ e conditional risk of action a is C θ ( a | r ) := E [ L θ ( a, Y ) | R τ ( X ) = r, D cal ] = (1 − η r ) L θ ( a, 0) + η r L θ ( a, 1) . Cost-coherence (interface-relativ e optimalit y). W e say ˜ π is c ost-c oher e nt on the deploy ed interface under pricing θ if for every region with µ r > 0 , C θ ( ˜ π ( r ) | r ) ≤ C θ ( a | r ) ∀ a ∈ A . This is the dominance condition induced by acting only on R τ ( X ) . 31 F.3 Inverse pricing envelop e F or each region with µ r > 0 , define the lo cal feasibility set Θ r ( ˜ π ) := n θ ∈ Θ : C θ ( ˜ π ( r ) | r ) ≤ C θ ( a | r ) ∀ a ∈ A o , and the global pricing env elop e Θ( ˜ π ) := \ r : µ r > 0 Θ r ( ˜ π ) . Equiv alently , Θ( ˜ π ) is the set of price parameters for which the forced action in each region is cost-coheren t giv en only the conformal output. Since only action comparisons matter, env elop es are naturally rep orted in ratio (pro jective) co ordinates: global p ositive scaling of L θ is irrelev ant, and adding offsets b y indep enden t of a preserves comparisons. F.4 W ork ed case: Cho w-style reject option and “commit on singletons” Consider A = { 0 , 1 , rej } with Cho w-style costs (false negativ e c 01 > 0 , false p ositive c 10 > 0 , rejection c rej ≥ 0 ): L (0 , 1) = c 01 , L (1 , 0) = c 10 , L (rej , y ) = c rej , L (0 , 0) = L (1 , 1) = 0 . In region r the conditional risks are C (0 | r ) = η r c 01 , C (1 | r ) = (1 − η r ) c 10 , C (rej | r ) = c rej . W ork in ratios λ := c 01 c 10 , ρ := c rej c 10 , so only ( λ, ρ ) matters for comparisons. Con ven tion. ˜ π (10) = 0 , ˜ π (01) = 1 , ˜ π (11) = rej , ˜ π (00) = rej . Coherence holds iff the region-wise dominance inequalities b elow hold for all regions with µ r > 0 . Singleton region 01 (output { 1 } ). Cho osing 1 must b eat b oth 0 and rej : (1 − η 01 ) ≤ η 01 λ ⇐ ⇒ η 01 ≥ 1 1 + λ , (1 − η 01 ) ≤ ρ ⇐ ⇒ η 01 ≥ 1 − ρ. Th us η 01 ≥ max  1 1 + λ , 1 − ρ  . Singleton region 10 (output { 0 } ). Cho osing 0 must b eat b oth 1 and rej : η 10 λ ≤ (1 − η 10 ) ⇐ ⇒ η 10 ≤ 1 1 + λ , η 10 λ ≤ ρ ⇐ ⇒ η 10 ≤ ρ λ . Th us η 10 ≤ min  1 1 + λ , ρ λ  . 32 Rejection regions r ∈ { 11 , 00 } . Rejecting must b eat b oth commitments: ρ ≤ η r λ ⇐ ⇒ η r ≥ ρ λ , ρ ≤ (1 − η r ) ⇐ ⇒ η r ≤ 1 − ρ. Hence rejection is optimal on region r only if ρ λ ≤ η r ≤ 1 − ρ. The rejection band is nonempty only if ρ λ ≤ 1 − ρ ⇐ ⇒ ρ ≤ λ 1 + λ ⇐ ⇒ c rej ≤ c 01 c 10 c 01 + c 10 . The inequalities ab ov e dep end only on the in terface primitives { η r } (and on which regions hav e µ r > 0 ). Th us, for a fixed calibrated conformal predictor and a fixed wiring conv ention ˜ π , coherence restricts the admissible cost ratios ( λ, ρ ) . Equiv alently , ( λ, ρ ) defines half-space constrain ts in the plane whose intersec tion is the pricing en velope Θ( ˜ π ) (p ossibly empty). This is the op erational con tent of (C3): calibration is distribution-free, but coheren t do wnstream use is constrained b y the coarse information carried by the conformal partition. G Explicit region indicators and p rojection masks This app endix instantiates the linear “sum selected cells” formalism from Section 2.4 in the binary toy geometry used in Figure 2 and Figure 3 ( R τ ( x ) π − − → C ( x ) ). Region feasibility and regime facts are recorded in App endix E; here we keep only the explicit pro jection masks used in audit computations. G.1 Region lab els and the region–lab el table Assume Y = { 0 , 1 } and thresholds τ = ( τ 0 , τ 1 ) . Define the deploy ed region lab el R τ ( x ) = ( r 0 ( x ) , r 1 ( x )) ∈ { 0 , 1 } 2 , r y ( x ) := 1 { s ( x, y ) ≤ τ y } , with region names R = { r 10 , r 11 , r 01 , r 00 } as in App endix E.1. F or a calibration setting θ (indexing deploy ed thresholds τ ( θ ) ), define the calibration-conditional region–lab el table P ( θ ) =  p r,y ( θ )  r ∈R , y ∈Y , p r,y ( θ ) = Pr  R τ ( θ ) ( X ) = r, Y = y | D cal  . A linear op erational rate is obtained by summing the subset of cells ( r, y ) that define the even t. G.2 P olicies used in Figure 3 A p olicy π maps region lab els to prediction sets C ( x ) ⊆ { 0 , 1 } : R τ ( x ) π − − → C ( x ) . The three p olicies used in Figure 3 are: Set inclusion π SI . π SI ( r 10 ) = { 0 } , π SI ( r 11 ) = { 0 , 1 } , π SI ( r 01 ) = { 1 } , π SI ( r 00 ) = ∅ . Commit–reject π CR . π CR ( r 10 ) = { 0 } , π CR ( r 01 ) = { 1 } , π CR ( r 11 ) = ∅ , π CR ( r 00 ) = ∅ . 33 Set exclusion π SE (complemen t of set inclusion). π SE ( r 10 ) = { 1 } , π SE ( r 11 ) = ∅ , π SE ( r 01 ) = { 0 } , π SE ( r 00 ) = { 0 , 1 } . G.3 Bina ry projection masks Fix a p olicy π . F or any even t/quantit y ℓ , define a 4 × 2 binary mask G ℓ ( π ) =  g ℓ ( r , y ; π )  r ∈R , y ∈Y ∈ { 0 , 1 } 4 × 2 , where g ℓ ( r , y ; π ) = 1 indicates that the cell ( r, y ) is included in the sum. Then the corresp onding rate is r ℓ ( θ ) = X r ∈R X y ∈Y g ℓ ( r , y ; π ) p r,y ( θ ) . W e list four masks used rep eatedly in the pap er. (1) Cov erage under set inclusion. Under π SI , cov erage is the ev ent { Y ∈ C ( X ) } . Cell-by-cell, co v erage holds for: (i) y = 0 in regions r 10 or r 11 , and (ii) y = 1 in regions r 01 or r 11 . Hence G cov ( π SI ) = y = 0 y = 1 r 10 1 0 r 11 1 1 r 01 0 1 r 00 0 0 and therefore r cov ( θ ) = p 10 , 0 ( θ ) + p 11 , 0 ( θ ) + p 11 , 1 ( θ ) + p 01 , 1 ( θ ) . Equiv alently , cov erage fails on the tw o singleton mistakes ( r 10 , 1) and ( r 01 , 0) and on all abstentions r 00 . (2) Missed p ositive mass (Figure 1: q 10 under π SI ). In Figure 1, q 10 is the p ositive-class mass in region r 10 (output { 0 } under π SI ): q 10 ( θ ) = Pr( Y = 1 , C ( X ) = { 0 } ) = Pr( Y = 1 , R τ ( X ) = r 10 ) = p 10 , 1 ( θ ) , with single-cell mask G 10 ( π SI ) = y = 0 y = 1 r 10 0 1 r 11 0 0 r 01 0 0 r 00 0 0 . (3) Hedged p ositiv e mass (Figure 1: q 11 under π SI ). Similarly , q 11 ( θ ) = Pr( Y = 1 , C ( X ) = { 0 , 1 } ) = Pr( Y = 1 , R τ ( X ) = r 11 ) = p 11 , 1 ( θ ) , with mask G 11 ( π SI ) = y = 0 y = 1 r 10 0 0 r 11 0 1 r 01 0 0 r 00 0 0 . 34 (4) Abstention mass under the commit–reject p olicy . Under π CR , abstention o ccurs in regions r 11 and r 00 , regardless of lab el. Th us G abs ( π CR ) = y = 0 y = 1 r 10 0 0 r 11 1 1 r 01 0 0 r 00 1 1 and r abs ( θ ; π CR ) =  p 11 , 0 ( θ ) + p 11 , 1 ( θ )  +  p 00 , 0 ( θ ) + p 00 , 1 ( θ )  . G.4 Ratios of projections (conditional diagnostics) Some diagnostics are conditional probabilities and therefore ratios of linear sums. F or example, the purity of singleton- 1 outputs under π SI is Purit y 1 ( θ ) := Pr  Y = 1 | C ( X ) = { 1 }  = Pr( Y = 1 , C ( X ) = { 1 } ) Pr( C ( X ) = { 1 } ) = p 01 , 1 ( θ ) p 01 , 0 ( θ ) + p 01 , 1 ( θ ) . A udit computation. All entries of P ( θ ) are estimated from region–lab el counts on the audit set. Linear KPIs are computed by summing selected cells; conditional diagnostics are computed as ratios of such sums. H T o x21 supplementary details This appendix provides supplementary do cumentation for the T ox21 exp eriments rep orted in Section 5.2. The purp ose of this app endix is repro ducibility and con textualization rather than extension of results or additional claims. H.1 T ox21 dataset The T ox21 b enchmark consists of binary toxicit y outcomes for tw elve biological assa ys spanning nuclear receptor (NR) signaling and cellular stress resp onse (SR) pathw ays. Each comp ound is lab eled as active or inactive p er assay . Lab els are sparse and highly imbalanced, particularly for n uclear receptor targets. The full dataset comp osition used in this study is rep orted in the main text (T able 2). W e do not duplicate that table here. H.2 Aggregated coverage summary The aggregated cov erage and prediction-set summary is rep orted in the main text (T able 3); w e do not duplicate that table here. H.3 Rep resentative endp oint-level op erational summaries The SR-MMP endp oint summary is rep orted in the main text (T able 4). W e keep one additional represen- tativ e endp oint (NR-AR) here to do cument a complementary low-prev alence regime. Because conformal calibration is p erformed in a class-conditional (Mondrian) manner, the effectiv e calibration size for the positive class go verns feasibility and discretization effects. Several assays op erate with fewer than 100 p ositive calibration samples, placing them in the small-sample regime targeted b y SSBC. H.4 Rep resentation and mo del protocol All T ox21 exp eriments use a fixed, task-agnostic molecular representation (descriptors plus Morgan finger- prin ts) and a fixed CatBo ost training proto col (1000 iterations, depth 6, learning rate 0.1, log-loss), with no 35 T able 8: NR-AR endp oin t. Operational p erformance summary for the NR-AR endp oint. All rep orted quan tities are joint probabilities normalized by the total test-set size. Rows rep ort the singleton rate, doublet rate, and wrong-singleton rate ( P ( Y = c, | S | = 1 , ˆ y  = Y ) ) by true class. "p oint estimate D 1 " and "LOO 95% PI D 1 " denote leav e-one-out p oint estimates and their exact b eta–binomial predictiv e interv als computed on the calibration data. "Observed D 2 " rep orts empirical rates on an indep endent hold-out set. "BB 95% PI D 2 " gives b eta–binomial predictive interv als for the corresp onding D 2 quan tities, accounting for calibration uncertain t y and finite test size. Op erational quantit y Class p oint estimate D 1 LOO 95% PI D 1 Observ ed D 2 BB 95% PI D 2 Singleton rate Class 0 0 . 163 [0 . 125 , 0 . 205] 0 . 133 [0 . 112 , 0 . 157] Class 1 0 . 026 [0 . 011 , 0 . 047] 0 . 029 [0 . 019 , 0 . 041] Doublet rate Class 0 0 . 796 [0 . 751 , 0 . 838] 0 . 818 [0 . 792 , 0 . 843] Class 1 0 . 015 [0 . 005 , 0 . 031] 0 . 020 [0 . 012 , 0 . 030] W rong-singleton rate Class 0 0 . 086 [0 . 058 , 0 . 119] 0 . 060 [0 . 045 , 0 . 077] Class 1 0 . 002 [0 . 000 , 0 . 010] 0 . 001 [0 . 000 , 0 . 004] assa y-sp ecific feature engineering, class reweigh ting, or resampling. Molecules are parsed and sanitized using standard c heminformatics to oling, and comp ounds with in v alid features are excluded. This in ten tionally non- optimized setup isolates conformal calibration and op erational-env elop e b ehavior from mo del-architecture tuning; observed A UR OC v alues (appro ximately 0.80–0.85) are consistent with baseline T o x21 rep orts. H.5 Confo rmal calibration and evaluation All conformal predictors are calibrated in a class-conditional (Mondrian) fashion, with separate calibration sets for the positive and negativ e classes. F or eac h run, one of three calibration strategies is applied: standard split conformal, DKWM-based correction, or SSBC, targeting ( α , δ ) = (0 . 10 , 0 . 10) . Calibration thresholds are computed once p er run and ev aluated on a held-out test set. Op erational rates (singleton and doublet frequencies) are estimated using leav e-one-out cross-v alidation on the calibration set. These LOO estimates are used to construct prediction interv als for future deploymen t batches of fixed size, as describ ed in the main text. I Solubilit y supplementa ry details This appendix records metho dological details for the solubilit y scenario-planning exp eriments in Section 5.3. The purp ose is repro ducibility and do cumentation. I.1 Op erational outcome definitions Op erational outcomes are defined in the main text via the binary prediction-set interface (T able 5). All rep orted quan tities in the scenario-planning analysis are joint rates o ver the set–lab el outcomes P ( Y = y , C = c ) , scaled to exp ected counts ov er a window of m = 1000 molecules. I.2 P areto-front artifact and rep resentative regimes The full deduplicated Pareto front used in Section 5.3 is pro vided as a CSV artifact ( pareto_full.csv ). Each ro w corresp onds to a distinct Pareto-optimal op erating point after deduplication in KPI space. The CSV records (i) the SSBC effective lev els ( α SSBC , 0 , α SSBC , 1 ) , (ii) the nominal request parameters ( α 0 , δ 0 , α 1 , δ 1 ) that induced the op erating p oint, (iii) the six joint outcome rates P ( Y = y , C = c ) , and (iv) predictiv e- en v elop e endp oints for each joint rate (LB/UB), rep orted on the same probability scale and then conv erted b elo w to coun ts p er 1000. T able 9 shows t wo represen tativ e Pareto regimes drawn from the provided P areto fron t to illustrate con- trasting planning postures: a loss-minimizing regime (lo w irreversible exclusion of soluble comp ounds) and a 36 T able 9: T wo representativ e P areto-optimal op erating regimes for the solubility scenario, illustrating con- trasting planning postures. Rates are exp ected coun ts per 1000 molecules, with predictiv e en velopes in brac k ets (derived from pareto_full.csv ). Loss-minimizing regime High-decisiv eness regime α 0 0.125 0.150 δ 0 0.150 0.150 α 1 0.050 0.150 δ 1 0.075 0.100 Rate In terv al Rate In terv al Loss rate 3 [0, 35] 12 [0, 58] W aste rate 70 [27, 130] 13 [0, 49] T otal hedge rate 785 [663, 896] 514 [376, 675] Correct soluble singleton rate 7 [0, 38] 88 [40, 155] Correct insoluble singleton rate 135 [68, 215] 282 [196, 375] Decisiv eness ( 1 − hedge) 215 486 decisiv eness-maximizing regime (lo w total hedging). Counts are exp ected even ts p er 1000 molecules; brack ets denote the corresp onding predictive en velopes. I.3 Dataset and lab el construction W e use AquaSolDB as the source of aqueous solubility measuremen ts ( log S ), follo wing the curation and qualit y-con trol pro cedures described in the dataset reference (Sorkun et al., 2019). F or mo del training and scoring w e discretize log S into three regimes with fixed thresholds: Insoluble ( log S < − 4 ), Mo derate ( − 4 ≤ log S < − 2 ), and Soluble ( log S ≥ − 2 ), defining a three-class lab el space Y = { 0 , 1 , 2 } (Kalepu & Nekkan ti, 2015). I.4 P artitioning: scaffold-isolated training and exchangeable evaluation W e use a t wo-stage split to separate generalization across chemical space from the exc hangeable design required for conformal calibration and finite-window op erational analysis: • Scaffold grouping. Molecules are group ed b y Bemis–Murck o scaffolds (Bemis & Murck o, 1996) computed from RDKit (Landrum, 2025) SMILES. • Scaffold-isolated training set. Scaffold groups are randomly sh uffled and assigned to the training set until appro ximately 70% of molecules are allo cated, ensuring no scaffold in training app ears in the remaining p o ol. • Exc hangeable calibration/test sets. The remaining scaffold-disjoint molecules are split into calibration and test sets by a stratified random split (50/50 of the remaining p o ol), preserving class prop ortions. These tw o sets are treated as exchangeable with resp ect to each other and to future dra ws under the ev aluation regime in Section 5.3. I.5 Molecula r representation Molecules are represented using a fixed descriptor set supplemen ted with Morgan fingerprints (Rogers & Hahn, 2010). Representation is kept task-agnostic; no assa y- or regime-specific feature engineering is p er- formed. Molecules that fail parsing or yield inv alid descriptor v alues are excluded. 37 I.6 Mo del and training protocol W e train a CatBoost (Prokhorenk ov a et al., 2018) gradient-bo osted decision tree classifier on the scaffold- isolated training set with fixed h yp erparameters (depth 6, learning rate 0.05, multi-class loss), up to 1000 b o osting iterations with early stopping (patience 50) using calibration-set p erformance. After training, the predictor is frozen and treated as infrastructure; the exp eriments study the conformal and op erational la yers rather than optimizing base accuracy . I.7 Calib ration and op erationalization Uncertain t y quan tification uses split conformal prediction with SSBC calibration (Section 5.1.1 and Ap- p endix C), which selects a grid index to stabilize realized cov erage semantics at user-sp ecified confidence (V ovk, 2012a; Y ang & Kuchibhotla, 2021). This calibration lay er supp orts the finite-windo w predictive- en v elop e constructions used for op erational planning. Although the classifier is trained on three classes (Insoluble, Mo derate, Soluble), the planning analysis in Section 5.3 uses a binary op erational ob jectiv e obtained by merging Moderate and Soluble into a single Soluble class. Conformal prediction sets are computed in the binary lab el space { Insol , Sol } and op erational quan tities (loss, w aste, hedging, decisiveness) are computed with resp ect to this binary interface. I.8 Deplo yment-matched calibration via chemical trib es Scenario planning conditions exchangeabilit y on a deploymen t-defining even t by restricting calibration to a c hemically defined subp opulation. Generic scaffold attribution. F or interpretabilit y we compute a generic scaffold representation by ex- tracting the Murck o scaffold and conv erting it to a generic form (all atoms carb on, all b onds single). This increases scaffold group size and supp orts qualitative attribution across broader structural families. T rib e construction. W e define coarse chemical “trib es” using RDKit MolLogP : Tribe_Lipophilic : MolLogP > 3 . 5 , Tribe_Hydrophilic : MolLogP < 1 . 0 , Tribe_Neutral : 1 . 0 ≤ MolLogP ≤ 3 . 5 . W e compute summary statistics of nonconformity and class probabilities stratified by trib e on both the calibration po ol and the held-out test window, and use these diagnostics to select a focal deploymen t regime. F o cal deplo yment scenario. Section 5.3 fo cuses on a lip ophilic deploymen t regime. SSBC calibration is therefore p erformed on the restricted sample D lip cal = { ( X i , Y i ) ∈ D cal : MolLogP( X i ) > 3 . 5 } , treated as exchan geable with resp ect to the intended deploymen t distribution. Scenario-conditional exchangeabilit y assumption. This restriction is scenario conditioning, not a general cov ariate-shift correction. W e assume deploymen t draws satisfy the same scenario even t E = { MolLogP( X ) > 3 . 5 } so calibration and deplo yment are exchangeable conditional on E . If deploymen t violates E , or if exchangeabilit y fails within E , conformal v alidity is not guaranteed. Shift-robust v alidity metho ds are complemen tary and outside scop e; see, e.g., F annjiang et al. (2022). In terpretation. Conditioning trades calibration sample size for impro v ed deplo ymen t matc h; the resulting loss of statistical efficiency app ears directly as stronger SSBC corrections and wider predictive env elop es, consisten t with transparen t planning under finite-sample uncertaint y . 38 I.9 F unctional roles of calibration pa rameters The P areto sweep in Section 5.3 v aries the four nominal parameters ( α 0 , δ 0 , α 1 , δ 1 ) (with SSBC mapping them to effective deploy ed grid lev els). Empirically the knobs are not in terchangeable: changes reallo cate mass among outcome categories (loss, waste, hedging, decisiveness) along channels constrained by the underlying threshold geometry . Throughout, class 0 denotes insoluble and class 1 denotes soluble. The qualitative roles b elo w summarize matc hed P areto-optimal solutions where one parameter v aries while the others are held fixed; rates are computed from the provided Pareto-fron t CSV and rep orted as even ts p er 1000 molecules. Role of α 0 : global conserv atism against irreversible loss. Decreasing α 0 suppresses loss by expand- ing hedging; increasing α 0 collapses ambiguit y into more decisive singleton predictions. In man y matched segmen ts, the dominant effect is redistribution b etw een hedged and singleton outcomes with loss already near saturation. Role of δ 0 : fine-scale sharp ening on the insoluble side. F or fixed ( α 0 , α 1 , δ 1 ) , δ 0 tends to act lo cally , shifting mass b et w een hedged and singleton insoluble predictions with limited impact on loss. Role of α 1 : boundary-controlled tolerance with nonlo cal effects. Although α 1 is nominally asso ci- ated with the soluble side, changing α 1 can mov e the op erating p oint across geometric b oundaries, inducing nonlo cal reallo cations that may app ear most strongly in insoluble outcomes. Role of δ 1 : systematic hedge-to-insoluble reassignmen t. Across stable regions of the fron t, increasing δ 1 often conv erts a p ortion of hedged mass into insoluble singletons, reflecting how the normalized score constrain ts and threshold geometry p osition the hedging interv al. Geometric interpretation. Overall, the Pareto fron t is shap ed by geometric coupling rather than in- dep enden t tuning: each knob reallocates probabilit y mass along constrained channels determined by the threshold partition and the finite-sample SSBC adjustment. I.10 Interp reting α versus δ on the Pa reto front Within SSBC the deplo yed threshold corresp onds to an adjusted effective level ˜ α determined join tly b y ( α, δ ) . Th us α and δ are b est viewed as tw o parameterizations of a single underlying control (the effective threshold), with different practical resolution under finite-sample constraints. The appro ximate feasibility relation δ ≈ (1 − α ) N implies log δ = N log(1 − α ) , so c hanges in δ corresp ond to fine-grained (approximately logarithmic) adjustments in ˜ α , whereas c hanges in α on a coarse grid can mo ve the op erating point across qualitativ ely differen t regions of the feasible manifold. This helps explain why α sweeps can trigger large reallo cations among loss/waste/hedging outcomes, while δ sw eeps more often act as lo cal “sharp ening” of decisions within a geometric regime. References Anastasios N. Angelop oulos and Stephen Bates. Conformal prediction: A gentle introduction. F oundations and T r ends in Machine L e arning , 16(4):494–591, 2023. doi: 10.1561/2200000101. Anastasios N. Angelop oulos, Stephen Bates, Adam Fisch, Lih ua Lei, and T al Sch uster. Conformal risk con trol. In International Confer enc e on L e arning R epr esentations , 2024. Rina F oygel Barb er, Emmanuel J. Candès, Aadity a Ramdas, and Ry an J. Tibshirani. Predictiv e inference with the jac kknife+. A nnals of Statistics , 49(1):486–507, 2021. URL 02928 . 39 Stephen Bates, Anastasios Angelop oulos, Lihua Lei, Jitendra Malik, and Mich ael I. Jordan. Distribution- free, risk-con trolling prediction sets. Journal of the A CM , 68(6), September 2021. ISSN 0004-5411. doi: 10.1145/3478535. URL https://doi.org/10.1145/3478535 . Guy W. Bemis and Mark A. Murck o. The prop erties of known drugs. 1. Molecular framew orks. Journal of Me dicinal Chemistry , 39(15):2887–2893, 1996. doi: 10.1021/jm9602928. Mic helle Bian and Rina F oygel Barb er. T raining-conditional co v erage for distribution-free predictiv e infer- ence. Ele ctr onic Journal of Statistics , 17(2):2044–2066, 2023. doi: 10.1214/23- EJS2145. George Casella and Roger L. Berger. Statistic al Infer enc e . Duxbury , 2nd edition, 2002. C. K. Chow. On optimum recognition error and reject tradeoff. IEEE T r ansactions on Information The ory , 16(1):41–46, 1970. doi: 10.1109/TIT.1970.1054406. H. A. David and H. N. Nagara ja. Or der Statistics . Wiley , 2003. Ary eh Dvoretzky , Jack Kiefer, and Jacob W olfo witz. Asymptotic minimax c haracter of the sample distribu- tion function and of the classical multinomial estimator. A nnals of Mathematic al Statistics , 27(3):642–669, 1956. Clara F annjiang, Stephen Bates, Anastasios N. Angelop oulos, Jennifer Listgarten, and Mic hael I. Jordan. Conformal prediction under feedback cov ariate shift for biomolecular design. Pr o c e e dings of the National A c ademy of Scienc es , 119(43):e2204569119, 2022. doi: 10.1073/pnas.2204569119. URL https://arxiv. org/abs/2202.03613 . Seymour Geisser. Pr e dictive Infer enc e: A n Intr o duction . Chapman and Hall, London, 1993. Isaac Gibbs, John J. Cherian, and Emmanuel J. Candès. Conformal prediction with conditional guarantees. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 87(4):1100–1126, 2025. doi: 10.1093/jrsssb/qkaf008. to app ear. Ian Go o dfellow, Y oshua Bengio, and Aaron Courville. De ep L e arning . MIT Press, Cambridge, MA, 2016. ISBN 9780262035613. W assily Ho effding. Probabilit y inequalities for sums of b ounded random v ariables. Journal of the A meric an Statistic al A sso ciation , 58(301):13–30, 1963. doi: 10.1080/01621459.1963.10500830. R uili Huang, Menghang Xia, Dac-T rung Nguyen, Jian ying Zhao, Srilatha Sakamuru, Ting jun Zhao, Maxine Na, Swati A. Shahane, Anastasia Rossoshek, and Anton Simeono v. The to x21 data challenge to build predictiv e mo dels of n uclear receptor and stress resp onse pathw ays. Natur e Biote chnolo gy , 34(8):828–837, 2016. doi: 10.1038/n bt.3659. Norman L. Johnson, Samuel Kotz, and Naray anaswam y Balakrishnan. Discr ete Multivariate Distributions . Wiley , New Y ork, 1997. ISBN 978-0-471-31250-3. Sandeep Kalepu and Vija ykumar Nekkan ti. Insoluble drug delivery strategies: review of recent adv ances and business prosp ects. A cta Pharmac eutic a Sinic a B , 5(5):442–453, 2015. doi: 10.1016/j.apsb.2015.07.003. Sha y an Kiyani, George J. P appas, Aaron Roth, and Hamed Hassani. Decision theoretic foundations for con- formal prediction: Optimal uncertain ty quantification for risk-av erse agen ts. In F orty-se c ond International Confer enc e on Machine L e arning , 2025. URL https://openreview.net/forum?id=Ukjl86EsIk . Greg Landrum. RDKit: Open-source c heminformatics. https://www.rdkit.org , 2025. [Online; accessed 2026-02-13]. Jing Lei, Max G’Sell, Alessandro Rinaldo, R yan J. Tibshirani, and Larry W asserman. Distribution-free predictiv e inference for regression. Journal of the A meric an Statistic al A sso ciation , 113(523):1094–1111, 2018. doi: 10.1080/01621459.2017.1307116. 40 Jordan Lek eufack, Anastasios N Angelop oulos, Andrea Ba jcsy , Mic hael I. Jordan, and Jitendra Ma- lik. Conformal decision theory: Safe autonomous decisions from imp erfect predictions. arXiv pr eprint arXiv:2310.05921 , 2023. P aulo C. F. Marques. Universal distribution of the empirical cov erage in split conformal prediction. Statistics & Pr ob ability L etters , 219:110350, 2025. ISSN 0167-7152. doi: 10.1016/j.spl.2024.110350. URL https: //www.sciencedirect.com/science/article/pii/S0167715224003195 . P ascal Massart. The tight constant in the dvoretzky–kiefer–w olfo witz inequality . A nnals of Pr ob ability , 18 (3):1269–1283, 1990. doi: 10.1214/aop/1176990746. Andreas Ma yr, Günter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. Deeptox: T oxicit y prediction using deep learning. F r ontiers in Envir onmental Scienc e , 3:80, 2016. doi: 10.3389/fenvs.2015.00080. Kaisa Miettinen. Nonline ar Multiobje ctive Optimization . Kluw er Academic Publishers, Boston, 1999. Harris Papadopoulos, Kostas Pro edrou, V olo dya V ovk, and Alex Gammerman. Inductive confidence ma- c hines for regression. In T apio Elomaa, Heikki Mannila, and Hannu T oivonen (eds.), Machine L e arning: ECML 2002 , volume 2430 of L e ctur e Notes in Computer Scienc e , pp. 345–356. Springer, 2002. doi: 10.1007/3- 540- 36755- 1\_29. Liudmila Prokhorenko v a, Gleb Gusev, Aleksandr V orob ev, Anna V eronika Dorogush, and Andrey Gulin. CatBo ost: un biased b o osting with categorical features. In A dvanc es in Neur al Information Pr o c essing Systems , volume 31, pp. 6638–6648, 2018. Da vid Rogers and Mathew Hahn. Extended-connectivity fingerprin ts. Journal of Chemic al Information and Mo deling , 50(5):742–754, 2010. doi: 10.1021/ci100050t. Y aniv Romano, Ev an Patterson, and Emmanuel J. Candès. Conformalize d quantile r e gr ession , pp. 1–11. Curran Asso ciates Inc., Red Ho ok, NY, USA, 2019. Glenn Shafer and Vladimir V ovk. A tutorial on conformal prediction. Journal of Machine L e arning R ese ar ch , 9:371–421, 2008. Murat Cihan Sorkun, Abhishek Khetan, and Süleyman Er. Aqsoldb: a curated reference set of aqueous solubilit y and 2d descriptors for a diverse set of comp ounds. Scientific Data , 6:143, 2019. doi: 10.1038/ s41597- 019- 0151- 1. URL https://www.nature.com/articles/s41597- 019- 0151- 1 . R y an J. Tibshirani, Rina F oygel Barb er, Emmanuel J. Candès, and Aadity a Ramdas. Conformal prediction under cov ariate shift. In A dvanc es in Neur al Information Pr o c essing Systems , volume 32, 2019. URL https://arxiv.org/abs/1904.06019 . Vladimir V o vk. Conditional v alidit y of inductive conformal predictors. arXiv , 2012a. doi: 10.48550/arXiv. 1209.2673. Extended v ersion of ACML 2012 pap er. Vladimir V ovk. Conditional v alidity of inductiv e conformal predictors. Journal of Machine L e arning R e- se ar ch , 13:955–997, 2012b. URL https://www.jmlr.org/papers/v13/vovk12a.html . Vladimir V ovk. Cross-conformal predictors. A nnals of Mathematics and A rtificial Intel ligenc e , 74:9–28, 2015. doi: 10.1007/s10472- 014- 9423- 0. Vladimir V ovk, Alex Gammerman, and Glenn Shafer. A lgorithmic L e arning in a R andom W orld . Springer, New Y ork, 2005. doi: 10.1007/b106715. F an Y ang and Arun K. Kuchibhotla. Finite-sample efficient conformal prediction. A nnals of Statistics , 49 (5):2921–2947, 2021. URL . W enbin Zhou and Shixiang Zh u. Calibrating decision robustness via in v erse conformal risk con trol. 2025. URL . 41

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment