Learning Decision-Sufficient Representations for Linear Optimization
We study how to construct compressed datasets that suffice to recover optimal decisions in linear programs with an unknown cost vector $c$ lying in a prior set $\mathcal{C}$. Recent work by Bennouna et al. provides an exact geometric characterization…
Authors: Yuhan Ye, Saurabh Amin, Asuman Ozdaglar
Learning Decision-Suf ficient Representations for Linear Optimization Y uhan Y e MIT yyh03@mit.edu Saurabh Amin MIT amins@mit.edu Asuman ¨ Ozda ˘ glar MIT asuman@mit.edu Abstract W e study how to construct compressed datasets that suffice to recover optimal decisions in linear programs with an unkno wn cost vector c lying in a prior set C . Recent work by Bennouna et al. ( 2025a ) provides an exact geometric characterization of suffi cient decision datasets (SDDs) via an intrinsic decision-relev ant dimension d ⋆ . Howe ver , their algorithm for constructing minimum-size SDDs requires solving mixed-inte ger programs. In this paper , we establish hardness results sho wing that computing d ⋆ is NP-hard and deciding whether a dataset is globally suf ficient is coNP-hard, thereby resolving an open problem posed in Bennouna et al. ( 2026 ). T o address this worst-case intractability , we introduce pointwise sufficiency , a relaxation that requires sufficienc y for an indi vidual cost vector . Under nondegenerac y , we pro vide a polynomial-time cutting- plane algorithm for constructing pointwise-sufficient decision datasets. In a data-driv en regime with i.i.d. costs, we further propose a cumulative algorithm that aggre gates decision-relev ant directions across samples, yielding a stable compression scheme of size at most d ⋆ . This leads to a distribution-free P A C guarantee: with high probability over the training sample, the pointwise sufficienc y failure probability on a fresh draw is at most ˜ O ( d ⋆ /n ) , and this rate is tight up to logarithmic factors. Finally , we apply decision-suf ficient representations to contextual linear optimization, obtaining compressed predictors with generalization bounds scaling as ˜ O ( p d ⋆ /n ) rather than ˜ O ( p d/n ) , where d is the ambient cost dimension. K eywords: Linear programming, Sample compression, P A C learning, Computational complexity , Decision- focused learning, Contextual optimization, Polyhedral geometry 1 Intr oduction In many real-world decision problems, the optimization objecti ve depends on parameters that are not directly observ able and must be inferred from data. W ith the surge in av ailable data, it has become common to use empirical evidence alongside contextual knowledge to support decision-making. The main challenge is to characterize the smallest set of information needed to identify an optimal decision and to recover it ef ficiently from finite samples via computationally tractable algorithms. Motiv ated by this, this paper studies the follo wing fundamental question: Which and how many objective measur ements are suf ficient to identify an optimal decision, and can we learn such measur ements with pr ovable guarantees in polynomial time? W e formulate the decision-making problem as a linear optimization with an unkno wn cost vector and a kno wn feasible region. The decision-maker solv es min x ∈X c ⊤ x, (1) 1 where X := { x ∈ R d : Ax = b, x ≥ 0 } is a nonempty bounded polytope for some A ∈ R m × d with full row rank and b ∈ R m . The objectiv e vector c ∈ R d is unkno wn, but it is kno wn a priori to lie in an uncertainty set C ⊂ R d . Rather than observing c directly , the decision-maker may adaptiv ely choose query directions D = { q 1 , . . . , q l } ⊂ R d sequentially and observ e inner products s ( c ; D ) := ( q ⊤ 1 c, . . . , q ⊤ l c ) ∈ R l , as linear measurements for objectiv e c . Gi ven observations s ∈ R l , we restrict plausible cost vectors to the fiber C ( D , s ) := { c ′ ∈ C : q ⊤ i c ′ = s i , i = 1 , . . . , l } . Acquiring objecti ve information is often the bottleneck: observing or predicting the full d -dimensional vector c can be unnecessary when optimal decisions only depend on a fe w decision-r elevant directions. A central geometric message is that decision-making depends on c only through a low-dimensional set of decision-rele vant directions. In particular , for open con vex C , Bennouna et al. ( 2025a ) provide an exact geometric characterization: a dataset is globally suf ficient on C if and only if its span contains the decision- rele vant subspace W ⋆ (formally introduced in Equation ( 3 )), whose intrinsic dimension d ⋆ is often much smaller than the ambient dimension d . Ho wev er , turning this characterization into an efficient procedure appears dif ficult; their Algorithm 2 constructs a minimum-size global sufficient dataset by solving a mixed- integer program at each iteration. In a recent work ( Bennouna et al. , 2026 ), whether a basis of W ⋆ can be constructed by a polynomial-time algorithm is stated as an open problem. In Section 4 , we sho w that computing d ⋆ is NP -hard (Theorem 5 ) and that ev en the weakest global suf ficiency test—deciding whether the empty dataset D = ∅ is globally suf ficient—is coNP -hard already when C is an open polyhedron specified in H -representation 1 (Theorem 10 ). Under P = NP , this gives a negati ve answer to the open problem in Bennouna et al. ( 2026 ): it is computationally hard in general to construct a basis of W ⋆ and hence a minimum-size global sufficient dataset. The hardness results establish a computational barrier for worst-case global suf ficiency b ut leav e open two important questions: (i) Can we ef ficiently construct sufficient datasets for indi vidual cost instances? (ii) In a data-driv en regime with typical costs, can we learn datasets with good av erage-case guarantees? W e answer both af firmativ ely , demonstrating a computational-statistical gap between w orst-case com- plexity and average-case learnability . Follo wing the data-driv en algorithm design paradigm ( Gupta and Roughgarden , 2017 , 2020 ; Balcan , 2021 ), we relax from global (uniform) sufficienc y over all c ∈ C to a distributional notion: we assume costs are drawn i.i.d. from an unknown distrib ution P c supported on C , and aim to learn a dataset ˆ D that is suf ficient for a fresh draw c ∼ P c with high probability . W e address this via two intermediate steps: (i) a tractable pointwise r elaxation for indi vidual instances (answering question (i) abov e), and (ii) a cumulativ e learning algorithm with P A C guarantees (answering question (ii)). This leads to a sample-complexity question: How many training instances ar e needed to learn such a ˆ D in polynomial time? T o obtain a polynomial-time learning algorithm at the lev el of realized cost vector samples, we first introduce pointwise suf ficiency : a dataset D is pointwise suf ficient at c if e very c ′ ∈ C ( D , s ( c ; D )) induces the same set of optimal decisions. This relaxation is tractable because it only requires the fiber to lie within one optimality cone rather than distinguishing decisions across all c in C . Under a standard nondegeneracy assumption on X , checking pointwise sufficienc y reduces to a geometric containment test that solves at most d − m instances of the face-intersection (FI) subproblem for a fixed optimal basis. When C is a polytope gi ven in explicit H -representation or an ellipsoid, each FI call can be solved in polynomial time, yielding an ov erall polynomial-time routine; see Property 21 . In Section 5 , we dev elop an adaptiv e cutting-plane algorithm (Algorithm 1 ) that discovers pointwise- suf ficient datasets. The key algorithmic inno vation is our f acet-hit cutting-plane rule: when the containment test fails, we identify the first facet of the optimality cone encountered along the segment joining an interior 1 An H -polyhedr on is specified by finitely many linear inequalities, e.g., P = { x ∈ R d : H x ≤ h } . A polytope is a bounded polyhedron; we use H -polytope when emphasizing an inequality representation of a polytope. An open polyhedron is obtained by replacing non-strict inequalities by strict ones. 2 point to the witness of failure. This ensures the queried facet normal is genuinely decision-relev ant. A nai ve approach querying arbitrary violated constraints can fail, as we demonstrate via a counterexample in Example 15 . Each query produces a direction linearly independent of pre vious ones, and the procedure terminates after at most d ⋆ queries. In Section 6 , in a data-driven re gime with i.i.d. costs, we run the pointwise routine cumulativ ely with warm starts (Algorithm 2 ) and expand the measurement set only on a small subset of “hard” instances. This yields a realizable stable compression scheme ( Hannek e and K ontorovich , 2021 ) with compression size at most d ⋆ . Consequently , with probability at least 1 − δ ov er n training samples, the pointwise-sufficiency failure probability on a fresh draw is at most e O ( d ⋆ /n ) (Theorem 25 ). This fast rate exploits realizability: our algorithm achiev es zero empirical suf ficiency loss via the compression framework, contrasting with data-dri ven projection approaches ( Balcan et al. , 2024a ; Sakaue and Oki , 2024 ) that control objecti ve v alue gaps and obtain characteristic e O (1 / √ n ) rates via uniform con vergence. The frameworks are complementary: ours certifies when projections preserve decisions; theirs bound ho w close projected v alues are to optimal. In practice, many decision systems repeatedly solve linear programs over a fixed feasible polytope X while the cost vector c ∈ C v aries. Examples include (i) r epeated LP with time-varying costs (routing, resource allocation), (ii) conte xtual linear optimization , where one learns a predictor for c and plugs it into the LP , and (iii) preference-based decision making, where feedback is limited to comparisons or other aggregate signals. This moti vates the tw o-stage pipeline below: • Stage I (Representation Disco very). Learn W ⋆ from i.i.d. training instances via adaptiv e linear queries, together with a distribution-free certificate on the probability of suf ficiency f ailure. • Stage II (T ask-Specific Deployment). Use the learned subspace for dimension reduction in repeated LPs, contextual LPs, and preference-based decision making. In Section 7 , as an illustration, we integrate this framew ork into SPO + training for contextual linear optimization by restricting to the discov ered decision-relev ant subspace. This reduces the dimension parameter appearing in the generalization bound (see Theorem 27 ). W e summarize our main contrib utions as follows. • Hardness r esults. Building on the SDD geometry of Bennouna et al. ( 2025a ), we sho w that computing the intrinsic decision-rele vant dimension d ⋆ is NP -hard (Theorem 5 ). W e also pro ve that deciding whether the empty dataset D = ∅ is globally suf ficient (in the decision sense) is coNP -hard (Theorem 10 ); for open con ve x C , this yields that constructing a minimum-size global SDD is NP - and coNP -hard (Corollary 11 ). Finally , verifying pointwise suf ficiency is coNP -hard in general (Theorem 9 ). • Per -instance level: finding pointwise SDDs in polynomial time. Under nondegeneracy , we design a facet-hit cutting-plane algorithm (Algorithm 1 ) that adapti vely queries decision-rele vant directions. The facet-hit rule ensures each query is linearly independent and lies in W ⋆ , returning a pointwise-suf ficient dataset (Theorem 20 ) in polynomial time. • Distributional lev el: learning decision-sufficient r epresentations with fast rates. W arm-starting the pointwise routine ov er i.i.d. costs (Algorithm 2 ) yields a realizable stable compression scheme ( Hanneke and K ontorovich , 2021 ) with compression size at most d ⋆ . This leads to a distribution-free P A C certificate scaling as e O ( d ⋆ /n ) (Theorem 25 ), matching our lo wer bound (Theorem 26 ). The fast rate exploits zero empirical loss and linear independence of decision-rele vant queries. • A pplication to contextual linear optimization. W e integrate decision-sufficient representation into predictor training for contextual linear optimization, yielding a compressed prediction model with improv ed generalization guarantees. In particular , the dimension term in the bound is reduced from d to d ⋆ (Theorem 27 ). 3 2 Related literatur e Dimension reduction and model compression f or LP . Dimension reduction is a classical tool for coping with high-dimensional optimization and learning. Random projections and sketching preserve geometry with high probability (e.g., Johnson–Lindenstrauss ( Johnson and Lindenstrauss , 1984 ); see ( W oodruff , 2014 )) and are ubiquitous in numerical linear algebra as well as learning-theoretic analyses ( Bartlett et al. , 2022 ). For linear programs, random projections can reduce problem size while approximately preserving feasibility and objecti ve v alues ( V u et al. , 2018 ; Poirion et al. , 2023 ), and Sakaue and Oki ( 2024 ) propose data-dri ven projections with generalization guarantees. In contrast, we seek exact optimizer recovery over a known polytope X by identifying the directions that can change the optimal solution; we quantify the intrinsic decision dimension by d ⋆ . Our work is most closely related to Bennouna et al. ( 2025a , 2026 ), which characterizes global sufficient decision datasets for LP under con ve x open priors and giv es an iterati ve construction with a mixed-inte ger program at each step. Data-driven algorithm design. Beyond worst-case analysis argues that worst-case complexity can be overly pessimistic and instead advocates structured or distributional models of “relev ant” instances ( Roughgarden , 2019 , 2021 ). Representativ e examples include perturbation-resilient instances ( Bilu and Linial , 2012 ), smoothed analysis ( Spielman and T eng , 2004 ), and planted or semi-random models ( Blum and Spencer , 1995 ). Data-dri ven algorithm design is a principled ML instantiation of this viewpoint, learning an algorithmic object (e.g., an algorithm f amily or configuration) from i.i.d. samples while retaining pro vable performance guarantees ( Gupta and Roughgarden , 2020 ; Balcan , 2021 ; Gupta and Roughgarden , 2017 ; Balcan et al. , 2017 , 2024a , b ). Under bounded loss, typical results yield uniform-conv ergence generalization gaps on the order of ˜ O ( p Pdim( A ) /n ) , and sharper bounds can be obtained via refined complexity notions such as dispersion and “knife-edge” structure ( Balcan et al. , 2018 , 2020 ). Our setting fits this paradigm by treating the queried directions as the learned object. For our specific problem, we exploit LP geometry to deri ve a stable compression scheme, yielding f ast-rate certificates scaling as e O ( d ⋆ /n ) . At a technical lev el, our geometry-aware cutting-plane viewpoint is also reminiscent of oracle-based framew orks that access constraints through separation ( Mhammedi , 2025 ). Compression-based generalization. Compression-based analyses provide distribution-free general- ization and scenario-type certificates, often yielding fast k /n -type rates when the learned object admits a small compression of size k ( V aliant , 1984 ; Littlestone and W armuth , 1986 ; Floyd and W armuth , 1995 ; Moran and Y ehudayof f , 2016 ; Graepel et al. , 2005 ). In the realizable regime, stable compression can further sharpen logarithmic factors ( Bousquet et al. , 2020 ; Hanneke and K ontorovich , 2021 ). In fully agnostic settings, ho wev er , 1 /n rates are impossible in general: sharp lower bounds sho w worst-case rates of order Θ p k log( n/k ) /n ( Hanneke and K ontorovich , 2019 ). Our cumulativ e algorithm in Section 6 fits naturally into this vie w: each newly queried direction is triggered by a “hard” sample revealing a genuinely new decision-rele vant facet, producing a compression set of size at most d ⋆ and a fast-rate certificate. Polyhedral containment. Our hardness results rely on classical comple xity phenomena in polyhedral computation. In particular , deciding containment of an H -polytope in a V -polytope is coNP -complete ( Freund and Orlin , 1985 ; Gritzmann and Klee , 1993 ). Sum-of-squares certificates provide po werful relaxations for containment questions and yield refined results for structured instances ( Kellner and Theobald , 2016 ). Since pointwise suf ficiency can be phrased as a containment statement that a data-consistent slice lies within an optimality region, these results naturally connect to the v erification problem studied in our paper . Contextual optimization. Our paper is also motiv ated by decision-focused (predict-then-optimize) learning, where one learns predicti ve models to support do wnstream optimization ( Elmachtoub and Grigas , 2022 ; El Balghiti et al. , 2023 ); see also the survey ( Sadana et al. , 2025 ). Beyond the batch setting, recent w ork studies online and acti ve learning v ariants of contextual linear optimization, including mar gin-based activ e learning ( Liu et al. , 2023 ) and online contextual decision-making with SPO-type surrogates ( Liu and Grigas , 2022 ); for background on acti ve learning, see Dasgupta ( 2011 ); Hanneke ( 2014 ). Bandit/partial-feedback 4 formulations are studied in Hu et al. ( 2024 ), and related safe-exploration objectives appear in safe linear bandits over unknown polytopes ( Gangrade et al. , 2024 ). On the statistical side, El Balghiti et al. ( 2023 ) deri ve uniform generalization bounds for the SPO loss via the Natarajan dimension of the induced decision class; in the polyhedral case, their bounds depend only logarithmically on the number of extreme points, yielding rates on the order of ˜ O p pd log ( n |X ∠ | ) /n for linear predictors. Recent work addresses model misspecification in contextual optimization ( Bennouna et al. , 2025b ) and benign generaliza tion behavior in stochastic linear optimization under quadratically bounded losses ( T elgarsky , 2022 ); see also Schutte et al. ( 2025 ) on suf ficient proxy representations. Active learning and adaptive measurement. Acti ve learning studies how to adaptiv ely acquire information—labels, features, or more general tests—to reduce query cost relati ve to passiv e sampling. A central idea is to query only where candidate hypotheses disagree, as formalized by disagreement-based acti ve learning ( Hanneke , 2014 ); see also Dasgupta ( 2011 ). Classical work distinguishes sample complexity from label (query) complexity in agnostic acti ve learning ( Balcan et al. , 2006 , 2009 ). Our per-instance model is closer in spirit to arbitrary-query and experimental-design vie ws of activ e learning ( Kulkarni et al. , 1993 ; Cohn et al. , 1996 ), since we design linear measurements q ⊤ c of a possibly unkno wn c . Frameworks accounting for heterogeneous query costs are also related to our measurement budget ( Guillory and Bilmes , 2009 , 2010 ). At the intersection with decision-focused learning, Liu et al. ( 2023 ) studies margin-based activ e learning for contextual linear optimization. In contrast, we lev erage polyhedral LP geometry: our facet-hit rule queries a violated optimality-cone facet normal, guaranteeing e xact decision identification after at most d ⋆ measurements and enabling a stable-compression vie w across i.i.d. instances. 3 Pr eliminaries: Characterizing sufficient datasets via LP geometry Recall that the fundamental question we address is: which datasets D ar e sufficient to solve the LP ? More precisely , when do the observations s ( c ; D ) , together with the prior restriction c ∈ C , contain enough information to recov er an optimal decision? Follo wing Bennouna et al. ( 2025a ), we adopt a global notion of informati veness in which a single fixed dataset D must work uniformly for all c ∈ C . W e now formalize this notion. Definition 1 (Global suf ficient decision dataset) . A dataset D := { q 1 , . . . , q l } ⊆ R d is a sufficient decision dataset (SDD) for the uncertainty set C ⊆ R d and decision set X if ther e exists a decision rule b X : R l → P ( X ) , wher e P ( X ) denotes the collection of subsets of X , such that ∀ c ∈ C , b X c ⊤ q 1 , . . . , c ⊤ q l = arg min x ∈X c ⊤ x. T o characterize sufficient datasets, we recall a few notions from LP geometry . Let X ∠ denote the set of extr eme points of X . For x ⋆ ∈ X ∠ , define the feasible direction cone F D ( x ⋆ ) := { δ ∈ R d : ∃ ε > 0 , x ⋆ + εδ ∈ X } and define the optimality cone Λ( x ⋆ ) := { c ∈ R d : x ⋆ ∈ arg min x ∈X c ⊤ x } . Let D ( x ⋆ ) be the set of e xtreme dir ections of the polyhedral cone F D ( x ⋆ ) . For δ ∈ D ( x ⋆ ) , define the corresponding boundary face F ( x ⋆ , δ ) := Λ( x ⋆ ) ∩ { δ } ⊥ = { c ∈ Λ( x ⋆ ) : c ⊤ δ = 0 } , and the set of r elevant extr eme directions ∆( X , C ) := n δ ∈ R d : ∃ x ⋆ ∈ X ∠ , δ ∈ D ( x ⋆ ) , and F ( x ⋆ , δ ) ∩ C = ∅ o . (2) Equi valently , δ is relev ant if the corresponding boundary face between optimality regions is attainable by some cost in C . For an y cost vector c , let X ⋆ ( c ) := arg min x ∈X c ⊤ x denote the (possibly set-v alued) set of optimal solutions. For a set C ⊆ R d , define X ⋆ ( C ) := [ c ∈C X ⋆ ( c ) ∩ X ∠ , dir( X ⋆ ( C )) := span { x − x ′ : x, x ′ ∈ X ⋆ ( C ) } . (3) 5 W e refer to W ⋆ := dir( X ⋆ ( C )) as the decision-r elevant subspace , and set d ⋆ := dim( W ⋆ ) . The follo wing two results are due to Bennouna et al. ( 2025a ). The first theorem states that global sufficienc y is equiv alent to spanning all directions along which optimality can change, as formalized by ∆( X , C ) . Theorem 2 ( Bennouna et al. ( 2025a ), Theorem 1) . Let C be open and con vex. A dataset D is an SDD for ( X , C ) if and only if ∆( X , C ) ⊆ span( D ) . While ∆( X , C ) is defined via faces of optimality cones, the next theorem gi ves an equi valent character - ization directly in decision space: the span of these relev ant directions matches the span of differ ences of r eachable optima . Theorem 3 ( Bennouna et al. ( 2025a ), Theorem 2) . F or any con vex set C ⊂ R d , span ∆( X , C ) = dir( X ⋆ ( C )) . Combining Theorems 2 and 3 yields the following subspace characterization, which we will use through- out. Corollary 4 (Subspace characterization Bennouna et al. ( 2025a ), Corollary 1) . Let C be open and con vex. A dataset D is an SDD for ( X , C ) if and only if dir( X ⋆ ( C )) ⊆ span( D ) , hence, the minimal size of global SDD is d ⋆ . In particular, the necessity direction of Theorem 2 uses that C is open. Accordingly , any argument or algorithm that in vok es this direction requires an openness assumption on C . In contrast, Theorem 3 requires only con ve xity and applies to both open and closed con vex sets. 2 4 Hardness and Relaxation 4.1 Computational hardness. The geometric characterizations in Section 3 are information-theoretic in nature: they identify the subspace that any global suf ficient dataset must capture. Algorithmically , a natural goal is to compute the minimum size of a global SDD and to construct one. By Corollary 4 , when C is open and con ve x, this minimum size equals the intrinsic decision-relev ant dimension d ⋆ = dim dir X ⋆ ( C ) . W e state informal versions of our hardness results below and provide brief proof sketches to highlight the structure of the reductions; full formal statements and complete proofs are deferred to Appendix A . Our reductions construct highly structured instances: shortest-path flo w polytopes with budgeted arc-length perturbations. Theorem 5 (Informal) . It is NP -har d to compute the intrinsic decision-rele vant dimension d ⋆ , given as input a polytope X ⊆ R d and a polyhedral uncertainty set C specified in H -r epr esentation. This har dness persists whether C is given as a closed polyhedr on or as an open polyhedr on. 2 This distinction is important: our pointwise-suf ficiency algorithms (Section 5 ) lev erage Theorem 3 under a closed conv ex prior , while the global SDD size characterization requires C to be open and con ve x. 6 Proof sketch for Theorem 5 . W e start from 3-SA T and use the PISPP-W+ construction of ( Ley and Merkert , 2025 , Theorem 3.1). Giv en a formula φ , the reduction outputs a directed acyclic graph (DA G) together with baseline arc-lengths d , a budget v ector κ , a budget B , and a required arc r , defining a budgeted uncertainty set of admissible length vectors c = d + w (with w ∈ Q A ≥ 0 and κ ⊤ w ≤ B ). The formula φ is satisfiable if and only if there exists such a modification w for which some shortest s – t path with respect to d + w contains the required arc r . W e then show that deciding the feasibility of the resulting PISPP-W+ instance reduces to checking whether dim dir( X ⋆ ( C )) > dim dir( X ⋆ r ( C )) , where X r := { x ∈ X : x r = 0 } ; see ( 16 ) . Geometrically , restricting to the face x r = 0 remov es a decision-relev ant extreme direction if and only if there exists a feasible modification that makes some shortest s – t path use r . For the open set C op , Lemma 32 “opens” the b udget via a small topological perturbation without changing the shortest-path structure and thus maintaining the answer . □ When C is open and con ve x (in particular , for the open polyhedral instances in Theorem 5 ), combining Corollary 4 and Theorem 5 implies that both computing the size of a minimum global SDD and constructing a minimum global SDD are NP -hard. 4.2 A relaxation: pointwise sufficient datasets The computational hardness of finding minimum global SDDs moti vates us to relax the concept to an instance-wise notion: Definition 6 (Pointwise suf ficient decision dataset) . F ix a (possibly unknown) c ∈ C . A finite query dataset D is pointwise sufficient at c if there e xists a decision x ⋆ ∈ X such that x ⋆ ∈ X ⋆ ( c ′ ) ∀ c ′ ∈ C D , s ( c ; D ) . Equivalently , the data-consistent fiber C ( D , s ( c ; D )) is contained in a single optimality r e gion of X . Remark 7. Since c ∈ C ( D , s ( c ; D )) , Definition 6 can equivalently be stated as: D is pointwise sufficient at c if there exists x ⋆ ∈ X ⋆ ( c ) such that x ⋆ r emains optimal for every c ′ ∈ C that pr oduces the same measurements, i.e., s ( c ′ ; D ) = s ( c ; D ) . This parallels the structur e of Definition 1 but applies to a single cost vector rather than all c ∈ C . The follo wing basic properties of pointwise sufficienc y follow immediately from the definition. Property 8. (i) Monotonicity . If D is pointwise sufficient at c and D ′ ⊇ D , then D ′ is also pointwise suf ficient at c , since C ( D ′ , s ( c ; D ′ )) ⊆ C ( D , s ( c ; D )) . (ii) Global ⇒ pointwise. If D is an SDD for ( X , C ) , then D is pointwise sufficient at e very c ∈ C . Pr oof of Pr operty 8 (i). Let D ⊆ D ′ and fix c ∈ C . Write s := s ( c ; D ) and s ′ := s ( c ; D ′ ) . By the definition of the fiber , C ( D ′ , s ′ ) = { c ′ ∈ C : q ⊤ c ′ = q ⊤ c ∀ q ∈ D ′ } . Since D ⊆ D ′ , any c ′ ∈ C ( D ′ , s ′ ) satisfies q ⊤ c ′ = q ⊤ c for all q ∈ D , and hence c ′ ∈ C ( D , s ) . Therefore C ( D ′ , s ′ ) ⊆ C ( D , s ) . No w assume D is pointwise suf ficient at c . Then there exists x ⋆ ∈ X such that x ⋆ ∈ X ⋆ ( c ′′ ) for all c ′′ ∈ C ( D , s ) . By the containment abo ve, the same x ⋆ is optimal for all c ′′ ∈ C ( D ′ , s ′ ) , so D ′ is also pointwise suf ficient at c . Pr oof of Pr operty 8 (ii). Let D be a global SDD for ( X , C ) in the sense of Definition 1 . By definition, there exists a mapping b X : R |D| → P ( X ) such that for e very c ∈ C , b X s ( c ; D ) = X ⋆ ( c ) . 7 Fix any c ∈ C and write s := s ( c ; D ) . For an y c ′ ∈ C ( D , s ) we ha ve s ( c ′ ; D ) = s by definition of the fiber , and therefore b X ( s ) = b X s ( c ′ ; D ) = X ⋆ ( c ′ ) . In particular, b X ( s ) is nonempty , and any choice of x ⋆ ∈ b X ( s ) satisfies x ⋆ ∈ X ⋆ ( c ′ ) for all c ′ ∈ C ( D , s ) . Thus the single decision x ⋆ is optimal for all costs in the fiber , meaning that D is pointwise sufficient at c in the sense of Definition 6 . A natural verification problem is: giv en ( X , C ) , a dataset D , and a cost vector c ∈ C , decide whether s ( c ; D ) already suf fices to determine an optimal decision. The next theorem shows that e ven the v erification problem is intractable in full generality . Theorem 9 (Informal) . It is coNP -har d to decide, given a bounded polytope X ⊆ R d , a polyhedral uncertainty set C ⊆ R d specified in H -r epr esentation, a dataset D , and a cost vector c ∈ C , whether D is pointwise sufficient at c . Proof sk etch for Theor em 9 . W e reduce from the classical H -in- V polytope containment problem. 3 This problem is coNP -complete ( Freund and Orlin , 1985 ; Gritzmann and Klee , 1993 ) e ven for the restricted family P = [ − 1 , 1] d ⊆ Q with 0 ∈ int( Q ) . Gi ven such an instance, we construct a bounded standar d-form polytope X = { z : Az = b, z ≥ 0 } with a distinguished verte x x 0 whose optimality cone encodes Q via a lifting: (( y , 1) , 0 , 0) ∈ Λ( x 0 ) ⇐ ⇒ ( y , 1) ∈ cone { ( v i , 1) } M i =1 ⇐ ⇒ y ∈ Q. W e then set C := { (( y , 1) , 0 , 0) : y ∈ P } and take D = ∅ , so the data-consistent fiber equals all of C . Since x 0 is the unique minimizer for a reference cost c 0 = ((0 , 1) , 0 , 0) , pointwise suf ficiency at c 0 reduces to the containment C ⊆ Λ( x 0 ) , which holds if and only if P ⊆ Q . See Appendix A.2 . □ The same containment-based construction can be strengthened to show that ev en the weakest global problem—deciding whether no data ( D = ∅ ) already suffices—is coNP -hard under an open, full-dimensional polyhedral prior , as stated next. Theorem 10 (Informal) . It is coNP -har d to decide whether the empty dataset D = ∅ is globally sufficient for ( X , C ) , when X ⊆ R d is a bounded polytope and C is an open polyhedr on specified in H -r epr esentation. Mor eover , computing the intrinsic decision-r elevant dimension d ⋆ is coNP -har d. Proof sketch f or Theorem 10 . Starting from the same containment instance, we replace the low-dimensional slice prior in the pointwise reduction by an open, full-dimensional polyhedron C op constructed through an effective-cost linear map T . The LP over X is designed so that its objective depends on c only through T ( c ) . W e choose an open polyhedron B op of effecti ve costs with strictly positive last coordinate and set C op := { c : T ( c ) ∈ B op } . In the YES case P ⊆ Q , openness implies B op ⊆ int cone { ( v i , 1) } M i =1 , making x 0 the unique optimizer for e very c ∈ C op and hence allo wing a decoder with |D | = 0 . In the NO case, B op contains an effecti ve cost outside cone { ( v i , 1) } M i =1 , producing two costs in C op with different optimizers and ruling out zero-query global suf ficiency . See Appendix A.3 for details. □ Consequently , computing the minimum size of global SDDs and outputting such a dataset are coNP -hard. Combining Theorems 5 and 10 with Corollary 4 yields the follo wing final statement. Corollary 11. F or con vex and open C , constructing a minimum-size global SDD and computing its minimum size ar e both NP -har d and coNP -har d. 3 A V -polytope is a polytope specified by its vertices (a vertex representation), e.g., Q = conv { v 1 , . . . , v M } . The H -in- V containment problem asks whether P ⊆ Q giv en ( H , h ) and { v j } M j =1 . 8 The coNP -hardness in Theorem 9 arises from degenerate LP geometry , where a single optimal extreme point may correspond to many bases, and its optimality region can hav e a complicated representation. Going forward, we adopt a standard nonde generacy assumption to recov er tractability . Assumption 12. The polytope X = { x ∈ R d : Ax = b, x ≥ 0 } is nonde generate: e very extr eme point x ⋆ ∈ X ∠ has exactly m strictly positive components. Under Assumption 12 , each e xtreme point has a single associated optimality cone that admits a simple linear inequality description. This makes it possible to test whether a fiber C ( D , s ) is contained in a candidate cone by solving only a polynomial number of LPs (as we do in Algorithm 1 ). In contrast, without Assumption 12 , e ven deciding pointwise suf ficiency is coNP -hard by Theorem 9 . 5 A T ractable Algorithm that Finds P ointwise SDD In this section, we present an adaptive cutting-plane routine that, for a fixed (and possibly unknown) cost vector c ∈ C , sequentially queries inner products q ⊤ c and outputs a dataset that is pointwise suf ficient at c . The routine maintains the data-consistent fiber C k = C ( D , s ( c ; D )) and repeatedly checks whether C k lies inside the optimality cone of a single LP basis; when containment fails, it queries the normal of a reachable violated facet to shrink the fiber . Under Assumption 12 , each iteration solves one LP over X and at most d − m con ve x minimization subproblems ov er the current fiber . Moreover , e very ne w queried direction is decision-rele vant (lies in dir( X ⋆ ( C )) ) and is linearly independent of the previous queries; consequently , the procedure makes at most d ⋆ queries. Unlike the global SDD characterization in Section 3 , pointwise suf ficiency is a containment statement about a single fiber C ( D , s ( c ; D )) and does not require C to be open. Accordingly , the only structural assumption we use in this section for the prior set is con ve xity; we take it closed only for algorithmic con venience (Remark 14 ). Remaining proofs for this section appear in Appendix B . Assumption 13. The uncertainty set C ⊆ R d is con vex. 5.1 Pointwise sufficiency as optimality-cone containment Fix a basis B ⊆ { 1 , . . . , d } with | B | = m and let N denote its complement. Write A = [ A B A N ] , and let A j be the j th column of A . The corresponding basic feasible solution is x ( B ) with x N ( B ) = 0 and x B ( B ) = A − 1 B b . For each nonbasic index j ∈ N , let δ ( B , j ) ∈ R d denote the standard edge direction obtained by increasing x j from 0 , i.e., δ N ( B , j ) = e j and δ B ( B , j ) = − A − 1 B A j . Under Assumption 12 , feasible bases are in one-to-one correspondence with vertices of X . Moreov er , for any feasible basis B , the corresponding x ( B ) is optimal for a cost v ector c if and only if all reduced costs (with respect to B ) are nonne gativ e. Equiv alently , the optimality region of x ( B ) is the polyhedral cone Λ( B ) := { c ∈ R d : c ⊤ δ ( B , j ) ≥ 0 ∀ j ∈ N } . (4) After querying directions q 1 , . . . , q k , let Q k = [ q 1 · · · q k ] ∈ R d × k and s k = Q ⊤ k c . The current fiber is C k := { c ′ ∈ C : Q ⊤ k c ′ = s k } . By Definition 6 , the dataset is pointwise suf ficient at c once there e xists a basis B such that C k ⊆ Λ( B ) . Using ( 4 ) , this containment reduces to checking that min c ′ ∈C k ( c ′ ) ⊤ δ ( B , j ) ≥ 0 for all j ∈ N , which moti v ates the face-intersection subproblem belo w . 5.2 The face-intersection (FI) subproblem For a direction δ ∈ R d and a fiber C k , define the face-intersection subproblem FI( δ ; C k ) as m k ( δ ) := min c ′ ∈C k ( c ′ ) ⊤ δ, c out k ( δ ) ∈ arg min c ′ ∈C k ( c ′ ) ⊤ δ. (5) 9 Geometrically , FI( δ ; C k ) checks whether C k intersects the open halfspace { c : c ⊤ δ < 0 } . If m k ( δ ) < 0 , the minimizer c out k ( δ ) certifies that C k ⊆ { c : c ⊤ δ ≥ 0 } . Under Assumption 13 , each call to FI( δ ; C k ) is a con ve x optimization problem. When C is polyhedral it is an LP; when C is an ellipsoid, FI( δ ; C k ) admits a closed form (Proposition 37 ), which can speed up implementations. Remark 14 (Closedness is not necessary) . P ointwise sufficiency and Algorithm 1 interact with the prior only thr ough containment tests of the form C k ⊆ Λ( B ) . Since Λ( B ) is closed, C k ⊆ Λ( B ) holds iff cl( C k ) ⊆ Λ( B ) . Thus, for these containment tests, one may r eplace a (possibly nonclosed) fiber by its closur e without changing any certification outcome. Moreo ver , any violating witness c out j ∈ C k \ Λ( B ) (whenever it e xists) r emains valid after this r eplacement. 5.3 A facet-hit cutting-plane algorithm Algorithm 1 A cutting-plane algorithm that finds pointwise SDD Input: LP data ( A, b ) , prior set C , a cost vector c ∈ C , and an initial dataset D init ⊂ R d . 1: Initialize D ← D init . 2: Let k ← |D | ; form Q k = [ q 1 · · · q k ] from D (any fix ed order) and set s k ← Q ⊤ k c . 3: while true do 4: Set C k := { c ′ ∈ C : Q ⊤ k c ′ = s k } and set c in ← c . 5: Solve min { ( c in ) ⊤ x : Ax = b, x ≥ 0 } and obtain an optimal basis B with nonbasis N . 6: (Containment test via face-intersection subproblem) F or each j ∈ N , form δ j := δ ( B , j ) and compute m j := min c ′ ∈C k ( c ′ ) ⊤ δ j , c out j ∈ arg min c ′ ∈C k ( c ′ ) ⊤ δ j . ( FI( δ j ; C k ) ) 7: Let j 0 ∈ arg min j ∈ N m j , set m min := m j 0 , c out := c out j 0 . 8: if m min ≥ 0 then 9: Return dataset D and certificate basis B (decision x ( B ) ); br eak . 10: else 11: (Facet-hit rule) F or each j ∈ N with ( c out ) ⊤ δ j < 0 , set α j := ( c in ) ⊤ δ j ( c in ) ⊤ δ j − ( c out ) ⊤ δ j ∈ [0 , 1) . 12: Let α ⋆ := min α j and pick any j ⋆ ∈ arg min α j . 13: Set q k +1 := δ j ⋆ and set σ k +1 ← q ⊤ k +1 c . 14: Update: D ← D ∪ { q k +1 } , Q k +1 ← [ Q k q k +1 ] , s k +1 ← ( s ⊤ k , σ k +1 ) ⊤ , k ← k + 1 . 15: end if 16: end while Algorithm 1 gi ves the full procedure. At each iteration it anchors at the realized cost v ector c (i.e., we take c in := c ∈ C k ), solves the LP under c in to obtain an optimal v ertex solution x ( B ) ; by Assumption 12 , this verte x uniquely determines the corresponding feasible basis B , and tests whether every cost vector in the fiber is contained in the corresponding cone Λ( B ) . This is done by ev aluating each facet inequality via the face-intersection subproblem. If the minimum violation is nonnegati ve, the fiber is fully inside Λ( B ) and pointwise suf ficiency holds. Otherwise, a witness point c out ∈ C k may violate se veral facet inequalities of Λ( B ) . The facet-hit rule identifies the first facet of Λ( B ) encountered along the segment [ c in , c out ] , and thus guarantees the existence of a boundary point c hit ∈ C k ∩ Λ( B ) with ( c hit ) ⊤ δ ( B , j ⋆ ) = 0 (Lemma 18 ), which is what makes the ne w query direction decision-relev ant. A concrete counterexample sho wing that an arbitrary violated facet can fail is gi ven belo w: Example 15 (A counterexample motiv ating the facet-hit rule) . Let X = [0 , 1] 2 and consider the verte x x = (0 , 0) , whose optimality cone is Λ = { c ∈ R 2 : c 1 ≥ 0 , c 2 ≥ 0 } , with facet hyperplanes c 1 = 0 and 10 c 2 = 0 . Let ε ∈ (0 , 1) and define c in = (1 , ε ) and c out = ( − 1 , − 1) . Consider any con vex fiber C k whose intersection with Λ is the se gment con v { c in , c out } (e.g ., take C k to be exactly this se gment). Then c out violates both inequalities c 1 ≥ 0 and c 2 ≥ 0 . Along the se gment c α = (1 − α ) c in + αc out , the coor dinate c 2 ,α hits 0 at a very small α while c 1 ,α is still strictly positive; thus C k ∩ Λ r eaches the boundary only thr ough the facet c 2 = 0 . In contrast, c α intersects c 1 = 0 only after c 2 ,α has alr eady become ne gative, i.e., outside Λ . Ther efor e, querying the normal of the “wr ong” violated facet ( c 1 = 0 ) does not corr espond to a boundary that the fiber can reac h while keeping x optimal. The facet-hit rule avoids this issue by selecting the first facet r eached fr om an interior anchor point. Remark 16 (Picking arbitrary c in ∈ C k ) . Algorithm 1 is stated in a fixed-anchor form (we set c in := c at each iter ation). The same facet-hit cutting-plane idea and all results in this section also apply when the cost vector c is unkno wn and one only has oracle access to inner pr oducts q ⊤ c : in that case, line 4 may pic k any anchor c in ∈ C k , and then pr oceed identically . Remark 17 (A deterministic tie-breaking con vention) . Thr oughout Algorithms 1 , whenever a choice is non-unique, we can further assume that ties ar e r esolved accor ding to a fixed deterministic rule. 5.4 Correctness and basic pr operties W e now record a few basic facts about Algorithm 1 . The next two lemmas are technical but important for our later analysis in Section 6 : Lemma 18 shows that e very ne w queried direction is genuinely decision-relev ant. Lemma 19 guarantees that the dataset grows with linearly independent directions. Theorem 20 sho ws the algorithm terminates and indeed certifies pointwise suf ficiency . Lemma 18. In Algorithm 1 , any newly added dir ection q k +1 lies in ∆( X , C ) ⊆ dir( X ⋆ ( C )) . Pr oof. Because B is an optimal basis for c in ∈ C k , we hav e c in ∈ Λ( B ) , i.e. ( c in ) ⊤ δ ( B , j ) ≥ 0 for all j ∈ N . Since we are in the Else branch, there exists c out ∈ C k with ( c out ) ⊤ δ ( B , j 0 ) = m min < 0 for at least one j 0 . Define the segment c α := (1 − α ) c in + αc out , α ∈ [0 , 1] . Because C k is con ve x, c α ∈ C k for all α ∈ [0 , 1] . By construction of α ⋆ and j ⋆ , we have ( c α ⋆ ) ⊤ δ ( B , j ⋆ ) = 0 and ( c α ⋆ ) ⊤ δ ( B , j ) ≥ 0 for all j ∈ N (first-hit property). Thus c hit := c α ⋆ ∈ C k ∩ Λ( B ) and lies on the face Λ( B ) ∩ { δ ( B , j ⋆ ) } ⊥ . By Equation ( 2 ) (with x ⋆ = x ( B ) and C = C k ), this implies δ ( B , j ⋆ ) ∈ ∆( X , C k ) . Since C k ⊆ C , we also hav e ∆( X , C k ) ⊆ ∆( X , C ) . Finally , by Theorem 3 , δ ( B , j ⋆ ) ∈ span ∆( X , C ) = dir( X ⋆ ( C )) , and in particular δ ( B , j ⋆ ) ∈ dir( X ⋆ ( C )) . Lemma 19. In Algorithm 1 , the queried directions ar e linearly independent. In particular , the algorithm makes at most d ⋆ queries. Pr oof. Assume for contradiction that q k +1 ∈ span( Q k ) . Then ( q k +1 ) ⊤ c ′ is constant ov er C k = { c ′ ∈ C : Q ⊤ k c ′ = s k } . In particular , ( q k +1 ) ⊤ c in = ( q k +1 ) ⊤ c out . But c in ∈ Λ( B ) implies ( q k +1 ) ⊤ c in ≥ 0 , while the facet-hit rule guarantees ( q k +1 ) ⊤ c out < 0 . This contradiction shows q k +1 / ∈ span( Q k ) , and therefore rank( Q k +1 ) = rank( Q k ) + 1 . Moreov er , since d ⋆ = dim(dir( X ⋆ ( C ))) and q k +1 ∈ dir( X ⋆ ( C )) by Lemma 18 , the algorithm makes at most d ⋆ queries. Theorem 20 (Correctness) . Algorithm 1 terminates after at most d ⋆ + 1 iterations and r eturns a dataset D that is pointwise sufficient at the (possibly unknown) c . 11 Pr oof. At termination we have m min ≥ 0 , hence for ev ery j ∈ N , min c ′ ∈C k ( c ′ ) ⊤ δ ( B , j ) ≥ 0 ⇒ ( c ′ ) ⊤ δ ( B , j ) ≥ 0 ∀ c ′ ∈ C k . By ( 4 ) , this implies C k ⊆ Λ( B ) . Therefore the fixed decision x ( B ) is optimal for e very c ′ ∈ C k = C ( D , s ( c ; D )) , so D is pointwise sufficient at c . By Lemma 19 , the algorithm mak es at most d ⋆ ne w queries. Since each non-terminating iteration adds exactly one ne w query direction, the while loop e xecutes at most d ⋆ + 1 iterations. Property 21. Assume that C is either (i) a polytope given in H -r epr esentation, or (ii) an ellipsoid. Then Algorithm 1 can be implemented to run in time polynomial in the input size . In particular , it makes at most d ⋆ oracle queries of the form q ⊤ c and solves at most ( d ⋆ + 1) LPs o ver X and ( d ⋆ + 1)( d − m ) face-intersection subpr oblems (LPs in case (i), and closed form in case (ii)). 6 Learning fr om Distrib utional Data Section 5 constructs a pointwise-sufficient query set for a single cost vector c . W e now mov e to a data-driven regime in which the LP ( 1 ) is solved repeatedly with random c ∼ P c supported on C , and we observ e i.i.d. samples c 1 , . . . , c n . Our goal is to learn a compr essed, decision-sufficient repr esentation : a small set of query directions that are pointwise sufficient for a fresh dra w from P c with high probability , together with a distribution-fr ee certificate on its failure probability . For a giv en dataset D as query directions, define the 0–1 sufficiency loss ℓ ( D , c ) := 1 {D is not pointwise sufficient at c } . (6) W e aim to output a dataset D with small risk R ( D ) := P c ∼ P c [ ℓ ( D , c ) = 1] . 6.1 A cumulative algorithm W e now run the pointwise cutting-plane routine sequentially on each training sample c i and accumulate the queried directions. Algorithm 2 initializes with an empty dataset and, for i = 1 , . . . , n , in vok es Algorithm 1 on c i using the current dataset as a warm start. If c i is already certified by the current dataset, nothing changes; otherwise, ne w directions are added until c i becomes pointwise suf ficient. W e call an inde x i har d if processing c i adds at least one ne w direction, i.e., D i = D i − 1 . Algorithm 2 Learning suf ficient decision datasets over samples Input: Prior C , LP data ( A, b ) , i.i.d. samples c 1 , . . . , c n (via oracle access to q ⊤ c i ). 1: Initialize dataset D 0 ← ∅ , hard index set T ← ∅ . 2: for i = 1 to n do 3: Run Algorithm 1 on c i with initialization D init = D i − 1 . Let the returned dataset be D i . 4: if D i = D i − 1 then 5: Mark i as hard: T ← T ∪ { i } . 6: end if 7: end for 8: Return final dataset D n and hard set T . The next three lemmas formalize the learning-theoretic structure underlying this cumulati ve procedure. T ogether, they show that Algorithm 2 induces a stable, realizable sample compression scheme ( Hanneke and K ontorovich , 2021 , Definitions 7–8) of compression size at most d ⋆ . This is the key mechanism we use in the next subsection to obtain a distribution-free fast-rate certificate on the true failure probability R ( D n ) = Pr c ∼ P c [ ℓ ( D n , c ) = 1] . 12 Lemma 22 (Realizability) . Algorithm 2 r eturns D n with ℓ ( D n , c i ) = 0 for all i = 1 , . . . , n . Pr oof. When processing c i , the inner run of Algorithm 1 certifies pointwise sufficienc y for c i , so ℓ ( D i , c i ) = 0 . For t > i , the cumulative procedure only enlarges the dataset, D t ⊇ D i , and Property 8 (i) implies ℓ ( D t , c i ) = 0 for all later t , hence also at t = n . Lemma 23 (Stability) . The final dataset D n r eturned by Algorithm 2 is fully determined by the compr essed subsequence ( c i ) i ∈ T , independent of the r emaining samples. Pr oof. By the deterministic tie-breaking con vention (Remark 17 ), Algorithm 1 defines a deterministic update map, and hence Algorithm 2 is a deterministic function of the sample sequence. In Algorithm 2 , the dataset only changes at indices in T by definition. Removing an index t / ∈ T remov es an iteration that would hav e run Algorithm 1 with initialization D t − 1 and returned the same dataset D t = D t − 1 . Therefore, deleting all non-hard iterations leav es the dataset entering each hard iteration unchanged, and thus reproduces the same sequence of dataset updates and the same final dataset. Lemma 24 (Compression size bound) . Under Assumption 12 , | T | ≤ |D n | ≤ d ⋆ . Pr oof. By Lemma 19 , each time Algorithm 1 appends a new query , that direction is linearly independent of the pre viously queried directions in that run. Since Algorithm 2 warm-starts each run at D i − 1 , this implies by induction ov er i = 1 , . . . , n that the cumulati ve datasets D i remain linearly independent. By definition, i ∈ T implies D i = D i − 1 , hence at least one ne w independent direction was added while processing c i , so | T | ≤ |D n | . Finally , by Lemma 18 , ev ery query direction that Algorithm 1 can append lies in dir( X ⋆ ( C )) . Since D n is the union of all appended directions across the cumulativ e run, we have D n ⊆ dir( X ⋆ ( C )) . Therefore, linear independence yields |D n | = dim span( D n ) ≤ dim dir( X ⋆ ( C )) = d ⋆ . 6.2 A distribution-fr ee certificate via stable compression W e can now in voke the clean stable-compression generalization bound of Hanneke and K ontoro vich ( 2021 , Corollary 11) to certify the true failure probability R ( D n ) = P c ∼ P c [ ℓ ( D n , c ) = 1] . For an alternative vie wpoint, see Campi and Garatti , 2023 ; Paccagnan et al. , 2025 . Theorem 25 (Certificate via stable compression) . F or any δ ∈ (0 , 1) , with pr obability at least 1 − δ over the draw of c 1 , . . . , c n , the output ( D n , T ) of Algorithm 2 satisfies R ( D n ) ≤ 4 n 6 | T | + ln e δ ≤ 4 n 6 d ⋆ + ln e δ , (7) wher e the second inequality uses | T | ≤ d ⋆ fr om Lemma 24 . Pr oof. W e apply the stable sample compression bound of Hanneke and K ontorovich ( 2021 , Corollary 11). T o be specific, we vie w any queried dataset D as inducing a binary prediction rule h D : C → { 0 , 1 } defined by h D ( c ) := ℓ ( D , c ) ∈ { 0 , 1 } . Thus R ( D ) = Pr C ∼ P c [ h D ( C ) = 1] = Pr C ∼ P c [ ℓ ( D , C ) = 1] is exactly the associated 0 – 1 risk. Equiv alently , one may regard this as a supervised distrib ution ov er ( C, Y ) with C ∼ P c and Y = 0 almost surely . Let S = ( c 1 , . . . , c n ) be the training sequence and let ( D n , T ) be the output of Algorithm 2 on S . Define a compression function κ by letting κ ( S ) be the subsequence of hard samples ( c i ) i ∈ T (in their original order). Define a reconstruction function ρ that maps any subsequence S ′ to the prediction rule h D ′ induced by the final dataset D ′ returned by running Algorithm 2 on S ′ . By Lemma 23 , running Algorithm 2 on κ ( S ) reproduces the same final dataset D n , so ρ ( κ ( S )) = h D n . Moreov er , Lemma 23 also implies the stability property of Hanneke and K ontorovich ( 2021 , Definition 8): 13 removing an y subset of the non-hard samples (i.e., elements of S \ κ ( S ) ) does not af fect the reconstructed output. Lemma 22 giv es ℓ ( D n , c i ) = 0 for all i = 1 , . . . , n , i.e., the empirical 0 – 1 risk of ρ ( κ ( S )) = h D n on S is zero. Apply Corollary 11 of Hanneke and K ontorovich , 2021 , with probability at least 1 − δ ov er the draw of S ∼ P n c , R ( D n ) = R ( h D n ) = R ( ρ ( κ ( S ))) ≤ 4 n 6 | κ ( S ) | + ln e δ = 4 n 6 | T | + ln e δ . Finally , Lemma 24 implies | T | ≤ d ⋆ , which gi ves the last sentence of Theorem 25 . Equation ( 7 ) gi ves the f ast-rate certificate R ( D n ) ≤ O ( d ⋆ + ln(1 /δ )) /n . Recent data-dri ven projection approaches for LPs learn a low-dimensional embedding that preserv es feasibility and objectiv e values approximately and then control a downstream error via uniform-con ver gence-style analyses; this leads to the characteristic 1 / √ n dependence on sample size and comple xity terms tied to the pseudo-dimension of the learned projection family ( Balcan et al. , 2024a ; Sakaue and Oki , 2024 ). Our guarantee is complementary . W e target decision suf ficiency (a binary property) and exploit polyhedral geometry to ensure only decision- rele vant directions can ev er be queried. This guarantees a stable compression scheme, yielding e O ( d ⋆ /n ) certificates that depend on the intrinsic dimension d ⋆ . This fast rate is also characteristic of realizability; in agnostic settings where zero empirical loss is unattainable, compression-based methods are subject to Ω p k log( n/k ) /n lo wer bounds ( Hanneke and K ontorovich , 2019 ), where k is the compression size. Finally , it is natural to ask whether the dependence on d ⋆ and n can be improved. The next theorem sho ws that, at least for the concrete cumulati ve procedure in Algorithm 2 , the fast-rate certificate is tight up to constants: one needs n = Ω( d ⋆ /ε ) samples to driv e the pointwise-suf ficiency failure probability below ε with constant confidence. Theorem 26. F ix any integ er d ⋆ ≥ 2 and any ε ∈ (0 , 1 / 4) . There exist an ambient dimension d ≥ d ⋆ , a nonde gener ate LP polytope X ⊆ R d , a con vex uncertainty set C ⊆ R d with dim(dir( X ⋆ ( C ))) = d ⋆ , and a distribution P c supported on C such that the following holds. If D n is the output of Algorithm 2 on n i.i.d. samples fr om P c and n ≤ d ⋆ − 1 8 ε , then P ( R ( D n ) > ε ) ≥ 1 2 . Proof of Theorem 26 and additional technical details appear in Appendix C . At a high le vel, the proof follo ws the classical “rare-types” construction used to obtain lo wer bounds for realizable sample-compression schemes ( Littlestone and W armuth , 1986 ). Howe ver , we cannot simply import a generic compression lower bound as a black box, because here the compression map is not arbitrary: it must arise from LP optimality geometry through Algorithm 1 . Accordingly , we explicitly construct a linear program, a con vex prior set C with dim(dir( X ⋆ ( C ))) = d ⋆ , and a distribution P c ov er c such that each rare e vent forces the discov ery of a distinct decision-rele vant direction. 7 A pplication: Model Compression f or Contextual Linear Optimization In this section, we illustrate ho w decision-sufficient representations yield a principled model-compression guarantee for contextual linear optimization (CLO). Let ( ξ , c ) ∼ P , where ξ ∈ Ξ ⊆ R p is a context and c ∈ C ⊆ R d (a.s.) is the cost vector of a do wnstream linear program o ver a kno wn bounded polytope X ⊆ R d . W e assume that C is con ve x. Let P c denote the marginal distribution of c . The goal in CLO is to train a predictor f : Ξ → R d from contextual data so that the induced plug-in decisions x ⋆ ( f ( ξ )) achie ve lo w out- of-sample loss. Throughout, we fix a deterministic oracle x ⋆ : R d → X ∠ such that x ⋆ ( v ) ∈ arg min x ∈X v ⊤ x for all v (e.g., with lexicographic tie-breaking rule). 14 Ellipsoidal prior with a canonical lifting map. Throughout this section, we specialize the prior set C to an ellipsoid C := { c ∈ R d : ( c − c 0 ) ⊤ Σ − 1 ( c − c 0 ) ≤ 1 } , for some Σ ≻ 0 , c 0 ∈ R d . Let W ⊆ R d be any t - dimensional subspace with orthonormal basis U ∈ R d × t . W e define the lifting matrix L U := Σ U ( U ⊤ Σ U ) − 1 , and the corresponding canonical lifting map (Appendix D.1 ) lift U ( s ) := c 0 + L U s, s ∈ R t . (8) The lifting map is introduced for a simple but important reason: T o use such a prediction in the original optimization problem, we must map it back to a full cost vector in R d . When C is not centered at the origin, a purely linear compressed predictor would typically ha ve range contained in a linear subspace through the origin and hence may not lie in C . The canonical lifting map provides a principled way to “complete” a lo w-dimensional coordinate into a cost vector that is feasible for the prior set C . T wo-stage pipeline. Our deployment follo ws a two-stage pipeline that separates decision-suf ficient repre- sentations from predictor training. In Stage I , we construct a dataset ˆ D that is decision-sufficient on the prior set C , and then form the induced subspace ˆ W := span( ˆ D ) with orthonormal basis ˆ U . In Stage II , we dra w independent contextual samples S := { ( ξ i , c i ) } n i =1 ∼ P and train a predictor that first predicts a coordinate in the learned subspace ˆ W . T o highlight the role of the decision-relev ant subspace, we first analyze Stage II under an idealized assumption that Stage I has recovered the global decision-rele v ant subspace. Concretely , let D be a minimal- size global sufficient dataset on C and define W ⋆ := span( D ) with intrinsic dimension d ⋆ := dim( W ⋆ ) . Let U ⋆ ∈ R d × d ⋆ be an orthonormal basis for W ⋆ . In Section 7.3 , we then show how to implement Stage I from contextual samples ( ξ , c ) via conditional-mean regression and Algorithm 2 , yielding an additional representation-estimation term. 7.1 SPO training in a decision-sufficient subspace Gi ven a predictor ˆ c = f ( ξ ) , the plug-in decision is x ⋆ (ˆ c ) and the SPO loss is ℓ SPO (ˆ c, c ) := c ⊤ x ⋆ (ˆ c ) − c ⊤ x ⋆ ( c ) ≥ 0 . The corresponding SPO risk of a predictor f : Ξ → R d is R SPO ( f ) := E ( ξ ,c ) ∼ P ℓ SPO ( f ( ξ ) , c ) = E ( ξ ,c ) ∼ P c ⊤ x ⋆ ( f ( ξ )) − c ⊤ x ⋆ ( c ) . In practice, gi ven i.i.d. samples S = { ( ξ i , c i ) } n i =1 in Stage II, we train θ by minimizing either the empirical SPO risk ˆ R SPO ( f ) := 1 n P n i =1 ℓ SPO ( f ( ξ i ) , c i ) or its con- ve x surrogate ˆ R SPO+ ( f ) := 1 n P n i =1 ℓ SPO+ ( f ( ξ i ) , c i ) . Follo wing Elmachtoub and Grigas ( 2022 ), the con ve x surrogate ℓ SPO+ (ˆ c, c ) := max x ∈X ( c − 2 ˆ c ) ⊤ x + 2 ˆ c ⊤ x ⋆ ( c ) − c ⊤ x ⋆ ( c ) . Let x 0 = x ⋆ ( c ) and let x 1 ∈ arg max x ∈X ( c − 2 ˆ c ) ⊤ x , equi valently x 1 = x ⋆ (2ˆ c − c ) . By Danskin’ s theorem, 2( x 0 − x 1 ) ∈ ∂ ˆ c ℓ SPO+ (ˆ c, c ) , (9) so each stochastic subgradient step requires at most two oracle calls, at c and 2ˆ c − c . Focusing on the decision-suf ficient subspace, one can parametrize a cost predictor by first predicting a d ⋆ -dimensional centered coordinate g θ : Ξ → R d ⋆ and then lifting it back to R d via ˆ c θ ( ξ ) = lift U ⋆ ( g θ ( ξ )) = c 0 + L U ⋆ g θ ( ξ ) , L U ⋆ := Σ U ⋆ ( U ⊤ ⋆ Σ U ⋆ ) − 1 . For linear coordinate models of the form g θ ( ξ ) = B θ ξ with B θ ∈ R d ⋆ × p , this reduces the number of trainable parameters from dp to d ⋆ p . Composing ( 9 ) with the af fine lifting map and applying the subgradient chain rule yields a v alid update in the compressed coordinates: for any choice of v ∈ ∂ ˆ c ℓ SPO+ (ˆ c θ ( ξ ) , c ) , the linearity of the lifting map implies L ⊤ U ⋆ v ∈ ∂ g ℓ SPO+ ( c 0 + L U ⋆ g , c ) g = g θ ( ξ ) . If g θ is dif ferentiable in θ , then ( ∇ θ g θ ( ξ )) ⊤ L ⊤ U ⋆ v ∈ ∂ θ ℓ SPO+ (ˆ c θ ( ξ ) , c ) , v := 2 x ⋆ ( c ) − x ⋆ 2( c 0 + L U ⋆ g θ ( ξ )) − c . For the linear coordinate model g θ ( ξ ) = B θ ξ , a valid stochastic subgradient for B θ at sample ( ξ , c ) is ( L ⊤ U ⋆ v ) ξ ⊤ . Algorithm 3 summarizes the resulting Stage II training routine. 15 Algorithm 3 Compressed SPO + training (Stage II) Input: Compressed dataset ˆ D , basis U ∈ R d × t for ˆ W = span( ˆ D ) , stepsizes { η k } k ≥ 0 1: Compute L U ← Σ U ( U ⊤ Σ U ) − 1 and define lift U ( s ) = c 0 + L U s 2: for k = 0 , 1 , 2 , . . . do 3: Sample ( ξ k , c k ) uniformly from S 4: Predict a centered coordinate in R t : ˆ s k ← g θ ( ξ k ) 5: Lift to R d : ˆ c k ← lift U ( ˆ s k ) = c 0 + L U ˆ s k 6: x 0 ← x ⋆ ( c k ) , x 1 ← x ⋆ (2ˆ c k − c k ) 7: v k ← 2( x 0 − x 1 ) 8: θ ← θ − η k ( ∇ θ g θ ( ξ k )) ⊤ L ⊤ U v k 9: end for 7.2 Impro ved generalization bound Consider the compressed affine-linear hypothesis class H U ⋆ ,d ⋆ := { f B ( ξ ) = c 0 + L U ⋆ B ξ : B ∈ R d ⋆ × p } . W e define the SPO range constant ω X ( C ) := sup c ∈C max x ∈X c ⊤ x − min x ∈X c ⊤ x . W ith these notations in place, we can state the follo wing guarantee for Stage II training. Theorem 27. Let H U ⋆ ,d ⋆ be as above and define the C -valued affine-linear classes H ( C ) := { f A ( ξ ) = c 0 + Aξ : A ∈ R d × p , f A ( ξ ) ∈ C a.s. } , H U ⋆ ,d ⋆ ( C ) := { f ∈ H U ⋆ ,d ⋆ : f ( ξ ) ∈ C a.s. } . (1) No misspecification loss. Let f ⋆ ∈ arg min f ∈H ( C ) R SPO ( f ) and define its compr essed version ˆ f ⋆ ( ξ ) := lift U ⋆ U ⊤ ⋆ ( f ⋆ ( ξ ) − c 0 ) = c 0 + L U ⋆ U ⊤ ⋆ ( f ⋆ ( ξ ) − c 0 ) ∈ H U ⋆ ,d ⋆ ( C ) . Then ˆ f ⋆ ∈ arg min f ∈H U ⋆ ,d ⋆ ( C ) R SPO ( f ) and R SPO ( f ⋆ ) = R SPO ( ˆ f ⋆ ) . (2) Improv ed generalization bound in the compressed class. F or any δ ∈ (0 , 1) , with pr obability at least 1 − δ over S , the following bound holds simultaneously for all f ∈ H U ⋆ ,d ⋆ : R SPO ( f ) ≤ ˆ R SPO ( f ) + 2 ω X ( C ) r 2( d ⋆ p + 1) log ( n |X ∠ | 2 ) n + ω X ( C ) r log(1 /δ ) 2 n . (10) Proof sk etch. The proof has three ingredients: (i) we bound the complexity of the induced decision class x ⋆ ◦ H U ⋆ ,d ⋆ via its Natarajan dimension, (ii) we plug this bound into the Natarajan-dimension generalization theorem for the SPO loss due to El Balghiti et al. ( 2023 ), and (iii) we sho w that under a global sufficient dataset, projecting to W ⋆ and lifting back is decision-preserving, so restricting to the compressed class incurs no approximation error . A complete proof is in Appendix D.2 . Compared to El Balghiti et al. ( 2023 ), the bound in ( 10 ) replaces the ambient dimension d in the dominant e O (1 / √ n ) generalization error term by the decision-rele vant intrinsic dimension d ⋆ . 7.3 Learning decision-sufficient r epresentation fr om contextual samples So far , our Stage II analysis assumed access to the global decision-rele v ant subspace W ⋆ . W e now e xplain ho w Stage I can be implemented from labeled contextual samples { ( ξ i , c i ) } N i =1 and quantify the resulting r epr esentation-estimation error . Recall that for contextual linear optimization under the SPO loss, the Bayes-optimal decision rule is x ⋆ ( µ ( ξ )) , where µ ( ξ ) := E [ c | ξ ] . Since C is con vex and c ∈ C a.s., we hav e µ ( ξ ) ∈ C a.s. If we could draw i.i.d. samples from the (unobserved) distribution of µ ( ξ ) , then we could run Algorithm 2 on those 16 samples and—by our certificate guarantee (Theorem 25 )—obtain a dataset that is pointwise suf ficient with high probability under the distribution of µ ( ξ ) . In practice, we only observe noisy costs c , so we proceed in two steps: (i) estimate µ from contextual samples via a C -v alued regression model ˆ µ , and (ii) treat the predictions ˆ µ ( ξ ) on fresh conte xts as pseudo-cost samples and run Algorithm 2 . The process is summarized in Algorithm 4 . Algorithm 4 Learning decision-suf ficient representation from contextual samples (Stage I) Input: Regression sample { ( ξ i , c i ) } n µ i =1 , discov ery contexts { ξ disc j } n disc j =1 . 1: Fit a regression model ˆ µ for µ ( ξ ) := E [ c | ξ ] (e.g., the centered linear model as discussed). 2: Form pseudo-costs ˆ c j ← ˆ µ ( ξ disc j ) for j = 1 , . . . , n disc . 3: Run Algorithm 2 on { ˆ c j } n disc j =1 and return ( ˆ D , T ) . A centered linear conditional-mean model. A con venient choice for ˆ µ is multi-response ordinary least squares (OLS). Because the prior set is the shifted ellipsoid C = { c : ( c − c 0 ) ⊤ Σ − 1 ( c − c 0 ) ≤ 1 } , it is natural to write the conditional-mean model in centered form around c 0 : c − c 0 = A µ ξ + ϵ, E [ ϵ | ξ ] = 0 , µ ( ξ ) = E [ c | ξ ] = c 0 + A µ ξ . (11) Equi valently , we regress the centered response y := c − c 0 onto ξ , and then set ˆ µ ( ξ ) := c 0 + ˆ A µ ξ . W e quantify the regression accurac y via the mean-squared prediction error ε 2 µ := E ξ ∥ ˆ µ ( ξ ) − µ ( ξ ) ∥ 2 2 . For the sharpened Stage I analysis below , we will use a stronger high-probability control: under bounded-design OLS, Appendix D.3 (Lemma 45 ) provides simultaneous bounds on ∥ ˆ A µ − A µ ∥ F , on the uniform prediction radius sup ∥ ξ ∥ 2 ≤ 1 ∥ ˆ µ ( ξ ) − µ ( ξ ) ∥ 2 , and hence on ε µ . A lifted compressed predictor . Let ˆ W := span( ˆ D ) and let ˆ U ∈ R d × t be an orthonormal basis of ˆ W , where t := dim( ˆ W ) . Define the lifted compr essed conditional-mean predictor ˜ µ ( ξ ) := lift ˆ U ˆ U ⊤ ˆ µ ( ξ ) − c 0 = c 0 + L ˆ U ˆ U ⊤ ˆ µ ( ξ ) − c 0 . (12) Under the centered linear model ( 11 ), we hav e ˆ µ ( ξ ) − c 0 = ˆ A µ ξ by the definition of ˆ µ , hence ˜ µ ( ξ ) = c 0 + L ˆ U ( ˆ U ⊤ ˆ A µ ) ξ . (13) Therefore ˜ µ belongs to the compressed af fine-linear class H ˆ U ,t := { f B ( ξ ) = c 0 + L ˆ U B ξ : B ∈ R t × p } . A cone-boundary margin condition. The oracle x ⋆ ( · ) is locally constant on the interior of each optimality cone of X and may change discontinuously on boundaries between cones. Define the cone-boundary set B X := c ∈ R d : ∃ x = x ′ ∈ X ∠ s.t. x, x ′ ∈ arg min x ∈X c ⊤ x . Our transfer argument from re gression error to decision error relies on the standard margin assumption: Assumption 28 (Margin condition) . Ther e e xist constants C marg > 0 and α > 0 suc h that, for all η > 0 , P ξ dist µ ( ξ ) , B X ≤ η ≤ C marg η α . Mor eover , almost sur ely over the r ealized Stage-I samples, the Sta ge-I r e gr ession pr edictor satisfies P ξ [ ˆ µ ( ξ ) ∈ C ] = 1 , P ξ [ ˜ µ ( ξ ) ∈ B X ] = 0 . 17 Stage-I repr esentation error . The next theorem bounds the additional decision loss introduced by learning the representation from contextual samples. Under the bounded-design OLS conditions from Appendix D.3 , the decision error decomposes into (i) a certificate err or term due to running Algorithm 2 on finitely many pseudo-costs, and (ii) a r e gr ession-to-decision term controlled by a uniform prediction radius and the cone-boundary margin. Theorem 29 (Stage-I representation error under bounded-design OLS) . Suppose Stage I runs Algorithm 4 with discovery sample size n disc and r eturns ( ˆ D , T ) , wher e T is the compr ession subsequence produced by Algorithm 2 . Under Assumption 28 , suppose in addition that the bounded-design OLS conditions of Lemma 45 hold. F ix δ µ , δ ∈ (0 , 1) and define r µ,δ µ := C reg · σ √ κ v u u t d p + log 4 d δ µ n µ . Then, with pr obability at least 1 − δ µ − δ over the r e gression sample and the i.i.d. disco very contexts, P ξ x ⋆ ( ˜ µ ( ξ )) = x ⋆ ( µ ( ξ )) ≤ 4 n disc (6 | T | + log ( e/δ )) + C marg r α µ,δ µ . (14) In particular , since | T | ≤ | ˆ D | ≤ d ⋆ , 4 n disc (6 | T | + log ( e/δ )) ≤ 4 n disc (6 d ⋆ + log( e/δ )) . Mor eover , letting f ⋆ ( ξ ) := µ ( ξ ) be a Bayes-optimal SPO predictor and letting ˆ f ⋆ ∈ arg min f ∈H ˆ U ,t R SPO ( f ) be the best pr edictor r estricted to the learned r epr esentation, we have the r epr esentation err or bound 0 ≤ R SPO ( ˆ f ⋆ ) − R SPO ( f ⋆ ) ≤ ω X ( C ) · 4 n disc (6 d ⋆ + log( e/δ )) + C marg r α µ,δ µ . (15) Proof idea. The certificate term in ( 14 ) follo ws by applying Theorem 25 to the pseudo-cost sample { ˆ c j = ˆ µ ( ξ disc j ) } n disc j =1 . The regression-to-decision term uses the tail-form transfer bound proved in Appendix D.4 together with the uniform prediction-radius bound ( 27 ) from Lemma 45 : on the OLS high-probability ev ent, we take η = r µ,δ µ so that the regression tail v anishes and only the cone-boundary margin term remains. The complete proof is in Appendix D.4 . Stage I repr esentation error rate under OLS. In Algorithm 4 , we use n µ contextual samples for regression and n disc contextual samples for discov ery . Under the centered linear model ( 11 ) , Theorem 29 yields an explicit bound for the representation term in ( 14 ) : up to logarithmic f actors, the additional Stage I representation error scales as e O n − α/ 2 µ + n − 1 disc . Let n I := n µ + n disc denote the total sample size used in Stage I. Under a constant-fraction split (e.g., n µ = ⌊ n I / 2 ⌋ and n disc = n I − n µ ), we hav e n µ = Θ( n I ) and n disc = Θ( n I ) ; hence this term simplifies to e O n − 1 I + n − α/ 2 I = e O n − min { 1 ,α/ 2 } I . Therefore, if α > 1 and Stage I and Stage II use comparable sample sizes (e.g., n I = Θ( n train ) ), the additional Stage I representation error is lo wer-order than the n − 1 / 2 train generalization term in Stage II. Consequently , the ov erall statistical rate is gov erned by Stage II, whose dominant term depends on the intrinsic dimension d ⋆ rather than the ambient dimension d ; see Theorem 27 . This improv es sample efficienc y for CLO. Moreov er , our decision-sufficient representation learning framew ork reduces the number of trainable parameters in Stage II from dp to d ⋆ p as an additional adv antage. 18 7.4 Numerical experiment W e provide a small synthetic shortest-path CLO experiment to illustrate the sample-ef ficiency gains suggested by the intrinsic d ⋆ dependence in Theorem 27 . W e consider a monotone shortest-path instance on a 5 × 5 grid ( g = 5 ), so the cost dimension is d = 2 g ( g − 1) = 40 . The feasible polytope has |X ∠ | = 2( g − 1) g − 1 = 70 extreme points, each corresponding to a monotone path. Contexts are drawn i.i.d. as ξ ∼ N (0 , I p ) with p = 5 . W e take C = { c ∈ R d : ∥ c − c 0 ∥ 2 ≤ 1 } , where c 0 assigns cost 10 on a fixed low-cost corridor and 100 else where. This forces all shortest paths to remain within the corridor , and by enumeration on the 5 × 5 grid, the resulting intrinsic dimension is d ⋆ = 7 . W e compare (i) full- d SPO + : a linear predictor ˆ c ( ξ ) ∈ R d trained by SGD on the SPO + surrogate, and (ii) ours (learn ˆ W then SPO + ) : first learn a subspace ˆ W online from observed contexts and costs, and then train a reduced predictor in the learned subspace. W e use n train = 300 labeled context–cost pairs for Stage II and an independent test set of size n test = 2000 , repeating ov er 10 random trials and reporting mean ± 90% confidence interv als. Stage I: Figure 1 reports the learned dimension t = dim( ˆ W ) . Stage II: Figure 2 plots the test SPO risk versus the number of labeled samples used to train the predictor . Consistent with our theory , restricting training to the learned subspace improves performance at a fixed sample size, and ˆ W quickly stabilizes near the true intrinsic dimension d ⋆ = 7 . Figure 1: Stage I. Learned dimension t = dim( ˆ W ) (mean over 10 trials). 19 Figure 2: Stage II. SPO risk vs. number of labeled samples (mean ± 90% CIs o ver 10 trials). 8 Concluding Remarks and Open Pr oblems Our current hardness results for computing the intrinsic decision-relev ant dimension d ⋆ (and hence con- structing minimum-size global SDDs) rely on highly de generate hard instances. It therefore remains open whether our hardness results persist under the nondegeneracy assumption on X . Second, our framew ork focuses on the noiseless setting, where each linear measurement q ⊤ i c is observed exactly . Extending our algorithms to noisy observations and characterizing the resulting sample complexity are open directions. Intuiti vely , such an extension would lik ely require an additional mar gin condition on P c that controls ho w often c lies close to the boundary of an optimality cone. Finally , we restrict attention to linear optimization in this paper . Extending the decision-sufficient representation frame work to broader problem classes, such as mixed-inte ger or conv ex programs, is an important direction for future work, potentially via approximate notions of suf ficiency . Refer ences Maria-Florina Balcan. Data-driv en algorithm design. In Tim Roughgarden, editor , Be yond the W orst-Case Analysis of Algorithms , pages 626–645. Cambridge Uni versity Press, 2021. Maria-Florina Balcan, Alina Beygelzimer , and John Langford. Agnostic acti ve learning. In Pr oceedings of the 23r d International Confer ence on Machine Learning , pages 65–72, 2006. Maria-Florina Balcan, Alina Beygelzimer , and John Langford. Agnostic acti ve learning. Journal of Computer and System Sciences , 75(1):78–89, 2009. Maria-Florina Balcan, V aishna vh Nagarajan, Ellen V itercik, and Colin White. Learning-theoretic foundations of algorithm configuration for combinatorial partitioning problems. In Pr oceedings of the 30th Confere nce 20 on Learning Theory (COLT) , v olume 65 of Pr oceedings of Machine Learning Resear ch , pages 213–274. PMLR, 2017. Maria-Florina Balcan, T ravis Dick, and Ellen V itercik. Dispersion for data-driv en algorithm design, online learning, and pri vate optimization. In 2018 IEEE 59th Annual Symposium on F oundations of Computer Science (FOCS) , pages 603–614. IEEE Computer Society , 2018. Maria-Florina Balcan, T uomas Sandholm, and Ellen V itercik. Refined bounds for algorithm configuration: The knife-edge of dual class approximability . In Pr oceedings of the 37th International Confer ence on Machine Learning (ICML) , v olume 119 of Pr oceedings of Machine Learning Resear ch , pages 580–590. PMLR, 2020. Maria-Florina Balcan, Dan DeBlasio, T ravis Dick, Carl Kingsford, T uomas Sandholm, and Ellen V itercik. Ho w much data is sufficient to learn high-performing algorithms? Generalization guarantees for data-dri ven algorithm design. Journal of the A CM , 71(5):32:1–32:58, 2024a. Maria-Florina Balcan, T ravis Dick, T uomas Sandholm, and Ellen V itercik. Learning to branch: Generalization guarantees and limits of data-independent discretization. Journal of the A CM , 71(2):13:1–13:73, 2024b. Peter Bartlett, Piotr Indyk, and T al W agner . Generalization bounds for data-dri ven numerical linear algebra. In Po-Ling Loh and Maxim Raginsky , editors, Pr oceedings of Thirty F ifth Confer ence on Learning Theory , volume 178 of Pr oceedings of Machine Learning Resear ch , pages 2013–2040. PMLR, 2022. Omar Bennouna, Amine Bennouna, Saurabh Amin, and Asuman Ozdaglar . What data enables optimal decisions? An e xact characterization for linear optimization. In Advances in Neur al Information Pr ocessing Systems , 2025a. NeurIPS 2025. Omar Bennouna, Jiawei Zhang, Saurabh Amin, and Asuman E. Ozdaglar . Contextual optimization under model misspecification: A tractable and generalizable approach. In Pr oceedings of the 42nd International Confer ence on Machine Learning , volume 267 of Pr oceedings of Machine Learning Resear ch , pages 3749–3775. PMLR, 2025b. Omar Bennouna, Amine Bennouna, Saurabh Amin, and Asuman Ozdaglar . Data informati veness in linear optimization under uncertainty , 2026. URL . Y onatan Bilu and Nathan Linial. Are stable instances easy? Combinatorics, Pr obability and Computing , 21 (5):643–660, 2012. A vrim Blum and Joel Spencer . Coloring random and semi-random k -colorable graphs. J ournal of Algorithms , 19(2):204–234, 1995. Oli vier Bousquet, Stev e Hanneke, Shay Moran, and Nikita Zhiv otovskiy . Proper learning, helly number , and an optimal SVM bound. In Jacob Abernethy and Shi v ani Agarwal, editors, Pr oceedings of the 33r d Confer ence on Learning Theory (COLT) , volume 125 of Pr oceedings of Machine Learning Resear ch , pages 582–609. PMLR, 2020. Marco C. Campi and Simone Garatti. Compression, generalization and learning. Journal of Machine Learning Resear ch , 24(339):1–74, 2023. David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. Activ e learning with statistical models. Journal of Artificial Intelligence Resear ch , 4:129–145, 1996. Sanjoy Dasgupta. T wo faces of acti ve learning. Theor etical Computer Science , 412(19):1767–1781, 2011. 21 Othman El Balghiti, Adam N. Elmachtoub, Paul Grig as, and Ambuj T ewari. Generalization bounds in the predict-then-optimize frame work. Mathematics of Operations Resear ch , 48(4):2043–2065, 2023. Adam N. Elmachtoub and P aul Grigas. Smart “predict, then optimize”. Management Science , 68(1):9–26, 2022. Sally Floyd and Manfred K. W armuth. Sample compression, learnability , and the vapnik-cherv onenkis dimension. Machine Learning , 21(3):269–304, 1995. Robert M. Freund and James B. Orlin. On the complexity of four polyhedral set containment problems. Mathematical Pr ogr amming , 33(2):139–145, 1985. Aditya Gangrade, T ianrui Chen, and V enkatesh Saligrama. Safe linear bandits ov er unknown polytopes. In Shipra Agrawal and Aaron Roth, editors, Pr oceedings of the 37th Conference on Learning Theory (COL T) , volume 247 of Pr oceedings of Machine Learning Resear ch , pages 1755–1795. PMLR, 2024. Thore Graepel, Ralf Herbrich, and John Shawe-T aylor . Pac-bayesian compression bounds on the prediction error of learning algorithms for classification. Machine Learning , 59(1-2):55–76, 2005. Peter Gritzmann and V ictor Klee. Computational complexity of inner and outer j-radii of polytopes in finite-dimensional normed spaces. Mathematical Pr ogramming , 59(2):163–213, 1993. Andre w Guillory and Jeff A. Bilmes. A verage-case active learning with costs. In Algorithmic Learning Theory , 20th International Confer ence, AL T 2009. Pr oceedings , v olume 5809 of Lecture Notes in Computer Science , pages 141–155. Springer , 2009. Andre w Guillory and Jeff A. Bilmes. Interactiv e submodular set cover . In Pr oceedings of the 27th Inter- national Confer ence on Mac hine Learning (ICML-10), J une 21-24, 2010, Haifa, Isr ael , pages 415–422, 2010. Rishi Gupta and T im Roughgarden. A P A C approach to application-specific algorithm selection. SIAM J ournal on Computing , 46(3):992–1017, 2017. Rishi Gupta and Tim Roughgarden. Data-dri ven algorithm design. Communications of the A CM , 63(6): 87–94, 2020. Ste ve Hanneke. Theory of disagreement-based acti ve learning. F oundations and T rends in Mac hine Learning , 7(2-3):131–309, 2014. Ste ve Hanneke and Aryeh K ontorovich. A sharp lo wer bound for agnostic learning with sample compression schemes. In Pr oceedings of the 30th International Confer ence on Algorithmic Learning Theory , v olume 98 of Pr oceedings of Machine Learning Resear ch , pages 489–505. PMLR, 2019. Ste ve Hanneke and Aryeh K ontorovich. Stable sample compression schemes: New applications and an optimal SVM mar gin bound. In V italy Feldman, Katrina Ligett, and Si van Sabato, editors, Pr oceedings of the 32nd International Confer ence on Algorithmic Learning Theory , volume 132 of Pr oceedings of Machine Learning Resear ch , pages 697–721. PMLR, 2021. Y ichun Hu, Nathan Kallus, Xiaojie Mao, and Y anchen W u. Contextual linear optimization with bandit feedback. In Advances in Neural Information Pr ocessing Systems , volume 37, 2024. W illiam B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Confer ence in Modern Analysis and Pr obability , v olume 26 of Contemporary Mathematics , pages 189–206. American Mathematical Society , 1984. 22 Kai K ellner and Thorsten Theobald. Sum of squares certificates for containment of H-polytopes in V- polytopes. SIAM Journal on Discr ete Mathematics , 30(2):763–776, 2016. S. R. Kulkarni, S. K. Mitter , and J. N. Tsitsiklis. Active learning using arbitrary binary valued queries. Machine Learning , 11(1):23–35, 1993. Ev a Ley and Maximilian Merkert. Solution methods for partial in verse combinatorial optimization problems in which weights can only be increased. Journal of Global Optimization , 93(1):263–298, 2025. Nick Littlestone and Manfred K. W armuth. Relating data compression and learnability . T echnical report, Uni versity of California, Santa Cruz, 1986. T echnical report. Heyuan Liu and P aul Grigas. Online contextual decision-making with a smart predict-then-optimize method, 2022. Mo Liu, Paul Grigas, He yuan Liu, and Zuo-Jun Max Shen. Acti ve learning for conte xtual linear optimization: A margin-based approach, 2023. arXi v:2305.06584v2 (Jan 2025). Zakaria Mhammedi. Online conv ex optimization with a separation oracle. In Nika Haghtalab and Ankur Moitra, editors, Pr oceedings of the 38th Confer ence on Learning Theory (COLT) , volume 291 of Pr oceed- ings of Machine Learning Resear ch , pages 4033–4077. PMLR, 2025. Shay Moran and Amir Y ehudayoff. Sample compression schemes for VC classes. Journal of the A CM , 63(3): 21:1–21:10, 2016. Dario Paccagnan, Daniel Marks, Marco C. Campi, and Simone Garatti. Pick-to-learn for systems and control: Data-dri ven synthesis with state-of-the-art safety guarantees, 2025. Pierre-Louis Poirion, Bruno F . Louren c ¸ o, and Akiko T akeda. Random projections of linear and semidefinite problems with linear inequalities. Linear Algebra and its Applications , 664:24–60, 2023. T im Roughgarden. Beyond worst-case analysis. Communications of the ACM , 62(3):88–96, 2019. T im Roughgarden, editor . Be yond the W orst-Case Analysis of Algorithms . Cambridge University Press, 2021. ISBN 9781108637435. Utsav Sadana, Abhilash Chenreddy , Erick Delage, Alexandre F orel, Emma Frejinger , and Thibaut V idal. A surve y of contextual optimization methods for decision-making under uncertainty . Eur opean Journal of Operational Resear ch , 320(2):271–289, 2025. Shinsaku Sakaue and T aihei Oki. Generalization bound and learning methods for data-driv en projections in linear programming. In Advances in Neural Information Pr ocessing Systems , volume 37, 2024. Noah Schutte, Grigorii V eviurk o, Krzysztof Postek, and Neil Y orke-Smith. Sufficient decision proxies for decision-focused learning, 2025. Shai Shale v-Shwartz and Shai Ben-Da vid. Understanding Machine Learning: F r om Theory to Algorithms . Cambridge Uni versity Press, 2014. ISBN 9781107057135. Daniel A. Spielman and Shang-Hua T eng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the A CM , 51(3):385–463, 2004. 23 Matus T elgarsky . Stochastic linear optimization ne ver ov erfits with quadratically-bounded losses on general data. In Po-Ling Loh and Maxim Raginsky , editors, Pr oceedings of the 35th Confer ence on Learning Theory (COL T) , volume 178 of Pr oceedings of Machine Learning Resear ch , pages 5453–5488. PMLR, 2022. Joel A. Tropp. User-friendly tail bounds for sums of random matrices. F oundations of Computational Mathematics , 12(4):389–434, 2012. Leslie G. V aliant. A theory of the learnable. Communications of the ACM , 27(11):1134–1142, 1984. Roman V ershynin. High-Dimensional Pr obability: An Intr oduction with Applications in Data Science . Cambridge Uni versity Press, 2018. K y Khac V u, Pierre-Louis Poirion, and Leo Liberti. Random projections for linear programming. Mathematics of Operations Resear ch , 43(4):1051–1071, 2018. David P . W oodruff. Sketching as a tool for numerical linear algebra. F oundations and T r ends in Theor etical Computer Science , 10(1-2):1–157, 2014. 24 A F ormal statement f or hardness and other proofs f or Section 4 A.1 Proof of Theor em 5 Theorem 30 (Formal v ersion of Theorem 5 ) . F ix a coor dinate index r ∈ [ n ] . The following decision pr oblem is NP -har d: given a bounded polytope X ⊆ R n and a polyhedral uncertainty set C ⊆ R n specified in H -r epr esentation, decide whether dim dir( X ⋆ ( C )) > dim dir( X ⋆ r ( C )) , (16) wher e X r := { x ∈ X : x r = 0 } and X ⋆ r ( C ) is defined as in Equation ( 3 ) with X r eplaced by X r . Consequently , computing dim dir( X ⋆ ( C )) is NP -har d under polynomial-time T uring r eductions. The har dness persists e ven when X is the s – t unit-flow polytope of a dir ected acyclic gr aph and C is a budg eted set of ar c-length incr eases of the form C cl = { d + w : w ≥ 0 , κ ⊤ w ≤ B } or C op = { d + w : w > 0 , κ ⊤ w < B + η } , for any fixed rational η ∈ (0 , 1) . Pr oof. W e prove NP-hardness via the 3 - S A T ⇔ P I S P P - W + construction of Ley and Merkert ( 2025 , Theo- rem 3.1). Giv en a 3 - S A T instance φ , their reduction produces a partial in v erse shortest path instance with only weight increases (PISPP-W+), consisting of a directed acyclic graph G = ( V , A ) with source s and sink t , initial arc-lengths d ∈ Q A ≥ 0 , modification costs κ ∈ Q A > 0 , a budget B ∈ Q > 0 , and a single required arc r ∈ A . The equi valent PISPP-W+ decision question is whether there e xists a modification vector w ∈ Q A ≥ 0 with κ ⊤ w ≤ B such that some shortest s – t path with respect to the modified arc-lengths d + w contains r . W e proceed with the proof in three steps. (1) Restating the hard instance and key properties: recall the explicit 3 - S A T → P I S P P - W + construction and the structural properties we will use. (2) Openifying the uncertainty set: show that passing from the closed budgeted set C cl to the slightly relaxed open set C op does not change the answer to the PISPP-W+ decision question on these instances. (3) Dimension comparison: encode s – t paths as extreme points of a flo w polytope and reduce the PISPP-W+ decision question to the strict inequality ( 16 ). Step 1: Hard instance and structural properties. W e first restate the explicit 3 - S A T → P I S P P - W + construction in the notation needed here. Let φ hav e variables x 1 , . . . , x n and clauses B 1 , . . . , B m . Write each clause as B j = { ℓ j 1 , ℓ j 2 , ℓ j 3 } where each literal ℓ j k ∈ { x i , ¯ x i : i ∈ [ n ] } (if a clause has fewer than three literals, duplicate one). For each ℓ j k we create a clause–literal verte x b j k labeled by ℓ j k . V ertices. Let V consist of { s 0 , . . . , s n } ∪ { t 0 , . . . , t m } ∪ { x i , ¯ x i : i ∈ [ n ] } ∪ { b j k : j ∈ [ m ] , k ∈ [3] } . Set the source s := s 0 and the sink t := t m . 25 Arcs. W e build a D A G G = ( V , A ) with a variable layer (from s 0 to s n ) and a clause layer (from t 0 to t m ), connected by one r equir ed arc and additional shortcut arcs. • V ariable gadgets: for each i ∈ [ n ] , add the four arcs ( s i − 1 , x i ) , ( x i , s i ) , ( s i − 1 , ¯ x i ) , ( ¯ x i , s i ) . Thus, any s 0 – s n path chooses exactly one of { x i , ¯ x i } for each i . • Clause gadg ets: for each j ∈ [ m ] and k ∈ [3] , add ( t j − 1 , b j k ) , ( b j k , t j ) . Thus, any t 0 – t m path chooses exactly one literal v ertex b j k per clause j . • Requir ed arc: add r := ( s n , t 0 ) , which is the unique required arc ( R = { r } ). • Shortcut ar cs: for each clause–literal vertex b j k labeled by a literal on variable x i , add exactly one arc from the opposite v ariable vertex into b j k : ℓ j k = x i ⇒ ( ¯ x i , b j k ) ∈ A, ℓ j k = ¯ x i ⇒ ( x i , b j k ) ∈ A. Equi valently , the tail of the shortcut arc is the variable verte x that makes ℓ j k false . Initial arc-lengths and modification costs. Set the initial lengths d by d ( a ) = 1 for e very non-shortcut arc a, d ( a ) = 2( n − i + j ) for a shortcut arc a = ( · , b j k ) with ℓ j k ∈ { x i , ¯ x i } . Let the budget be B := n + 2 m and set modification costs κ by κ a = 1 for every non-shortcut arc a, κ a = B + 1 for e very shortcut arc a. Lemma 31 (Structural properties) . The above instance satisfies the following pr operties (cf. Observations 1–4 and the claim in the pr oof of ( Le y and Merkert , 2025 , Theor em 3.1)): (i) Degree constraints. Each x i and ¯ x i has exactly one ingoing ar c (fr om s i − 1 ), and each b j k has exactly one outgoing ar c (to t j ). (ii) P ath structure (“ r or one shortcut”). Every s – t path contains either the r equir ed ar c r or e xactly one shortcut ar c (but not both). (iii) Unit gap under d . Every s – t path using r has length 2 n + 2 m + 1 under d , while every s – t path using a shortcut ar c has length 2 n + 2 m under d . In particular (with w = 0 ), all shortest s – t paths avoid r . (iv) Encoding of assignments + “false literal ⇒ available shortcut”. If an s – t path P uses r , then it visits exactly one of { x i , ¯ x i } for every i ∈ [ n ] and e xactly one b j k in every clause gadget j ∈ [ m ] . Define the induced truth assignment by x P i = 1 iff x i ∈ P . If P visits a literal verte x b j k that is false under x P , then the unique shortcut ar c entering b j k has its tail on P . (v) 3-SA T ⇔ PISPP-W+. The mapping φ 7→ ( G, d, κ, B , r ) is computable in polynomial time and satisfies the following equivalence: the formula φ is satisfiable if and only if there e xists a modification vector w ∈ Q A ≥ 0 with κ ⊤ w ≤ B such that some shortest s – t path with r espect to the modified ar c-lengths d + w contains the r equir ed ar c r . Equivalently , if φ is unsatisfiable then for every w ∈ Q A ≥ 0 with κ ⊤ w ≤ B , every shortest s – t path under d + w avoids r . 26 Pr oof. (i) is immediate from the arc construction. (ii) The only arcs that can enter the clause layer are r = ( s n , t 0 ) and the shortcut arcs into some b j k . Once the path enters the clause layer , it cannot return to the variable layer (there are no such arcs), and each b j k has only the outgoing arc ( b j k , t j ) ; hence no s – t path can use two shortcuts, and no path can use both r and a shortcut. (iii) Any s – t path using r trav erses 2 n unit-length arcs in the variable g adgets, then r (length 1 ), then 2 m unit-length arcs in the clause gadgets, totaling 2 n + 2 m + 1 . A path using a shortcut from variable i into clause j has prefix length 2( i − 1) + 1 up to the tail v ariable verte x, then shortcut length 2( n − i + j ) , then suf fix length 1 + 2( m − j ) in the clause layer , totaling (2( i − 1) + 1) + 2( n − i + j ) + (1 + 2( m − j )) = 2 n + 2 m. (i v) The first statement follo ws since each variable g adget offers e xactly two disjoint choices and each clause gadget offers exactly three disjoint choices. F or the last statement, if b j k is false under x P , then P must hav e visited the opposite variable verte x ( ¯ x i if ℓ j k = x i , and x i if ℓ j k = ¯ x i ), which is exactly the tail of the unique shortcut arc into b j k . (v) is exactly the claim of Theorem 3.1 in Ley and Merkert ( 2025 ), which shows NP-completeness of PISPP-W+. W e provide a proof sketch here: ( ⇒ ) Let x ∗ satisfy φ . W .l.o.g. reorder literals in each clause so that b j 1 is true. Build the s – t path P that uses r by follo wing x ∗ in the v ariable gadgets and b j 1 in each clause gadget. Set w = 1 on the unique entry arc into the unchosen v ariable verte x (one per v ariable) and set w = 1 on ( b j 2 , t j ) and ( b j 3 , t j ) (two per clause); set w = 0 on all remaining arcs (in particular on shortcut arcs). Then κ ⊤ w = n + 2 m = B , and any shortcut path must tra verse at least one penalized arc, so its length increases by ≥ 1 , closing the initial gap d ( Q ) = d ( P ) − 1 and implying P is (one of) the shortest paths. ( ⇐ ) Suppose κ ⊤ w ≤ B and some shortest path P under d + w uses r ; let x P be the assignment induced by P . If some clause is false under x P , then the visited literal vertex b j k is false, and by (iv) the entering shortcut arc e has its tail on P . Replacing the corresponding subpath R of P by e gi ves a shortcut path Q with d ( Q ) = d ( P ) − 1 (by (iii)). Thus ( d + w )( Q ) − ( d + w )( P ) = − 1 + w ( e ) − w ( R ) ≤ − 1 + w ( e ) , so shortestness of P forces w ( e ) ≥ 1 . Since e is a shortcut arc with κ e = B + 1 , this implies κ ⊤ w ≥ κ e w ( e ) > B , a contradiction. Step 2: Openifying the uncertainty set does not change the answer . This step is only needed to ensure hardness persists for an open uncertainty set. W e consider two uncertainty sets of admissible arc-length vectors: C cl := { c = d + w : w ∈ R A ≥ 0 , κ ⊤ w ≤ B } and C op := { c = d + w : w ∈ R A > 0 , κ ⊤ w < B + η } , where we fix any rational constant η ∈ (0 , 1) (e.g. η = 1 2 ). The lemma belo w shows that for the hard instances abov e, replacing C cl by C op does not change the answer to the underlying question. Lemma 32. F ix any η ∈ (0 , 1) . F or the PISPP-W+ instance pr oduced by Le y and Merk ert ( 2025 ) fr om φ , the formula φ is satisfiable if and only if there e xists c ∈ C op such that some shortest s – t path with r espect to c contains r . Pr oof. ( ⇒ ) Suppose φ is satisfiable. By Lemma 31 (v), there exists w 0 ∈ Q A ≥ 0 with κ ⊤ w 0 ≤ B such that for c 0 := d + w 0 there is a shortest s – t path using r . Since G is a DA G, fix any topological ordering ρ : V → { 0 , 1 , . . . , | V | − 1 } . For (a rational) ε > 0 , define a perturbation δ ∈ Q A by δ ( u, v ) := ε ( ρ ( v ) − ρ ( u )) ∀ ( u, v ) ∈ A. 27 Then δ > 0 componentwise. Moreover , for any s – t path P = ( v 0 = s, v 1 , . . . , v k = t ) , the perturbation telescopes: k − 1 X i =0 δ ( v i , v i +1 ) = ε k − 1 X i =0 ( ρ ( v i +1 ) − ρ ( v i )) = ε ( ρ ( t ) − ρ ( s )) , which is a constant independent of P . Hence, adding δ shifts the length of ev ery s – t path by the same constant, so the set of shortest s – t paths is unchanged when we replace c 0 by c 0 + δ . Finally , choose ε small enough so that κ ⊤ δ < η . For instance, since ρ ( v ) − ρ ( u ) ≤ | V | − 1 for e very arc, κ ⊤ δ = X a ∈ A κ a δ a ≤ ε ( | V | − 1) X a ∈ A κ a , so it suffices to take ε := η / 2( | V | − 1) P a ∈ A κ a . W ith this choice, w := w 0 + δ satisfies w > 0 and κ ⊤ w < B + η , i.e. c := d + w ∈ C op , and there still exists a shortest path using r . ( ⇐ ) W e prove the contrapositi ve. Suppose φ is not satisfiable, assume that there exists a shortest s – t path P with respect to c := d + w ∈ C op that contains r . By Lemma 31 (ii), P contains no shortcut arc. Define the induced assignment x P as in Lemma 31 (iv). Since φ is unsatisfiable, there exists a clause index j that is not satisfied by x P . By Lemma 31 (i v), P visits exactly one literal v ertex b j k in clause gadget j ; because clause j is unsatisfied, this visited literal verte x b j k is false under x P . Therefore Lemma 31 (iv) yields that the unique shortcut arc e entering b j k has its tail on P . Let R denote the (nonempty) subpath of P from the tail of e to the verte x b j k , and define a new s – t path Q by follo wing P up to the tail of e , then trav ersing e , and then following P from b j k to t . Then Q uses a shortcut arc, so by Lemma 31 (iii) we hav e d ( Q ) = d ( P ) − 1 . Since P and Q coincide outside of R and e , we obtain ( d + w )( Q ) − ( d + w )( P ) = ( d ( Q ) − d ( P )) + ( w ( Q ) − w ( P )) = − 1 + w ( e ) − w ( R ) . Because R contains at least one arc and w > 0 componentwise, we hav e w ( R ) > 0 . Thus if P is shortest we must hav e 0 ≤ ( d + w )( Q ) − ( d + w )( P ) , implying w ( e ) > 1 . As e is a shortcut arc, κ e = B + 1 , hence κ ⊤ w ≥ κ e w ( e ) > ( B + 1) · 1 = B + 1 . In particular , for any η ∈ (0 , 1) we ha ve κ ⊤ w > B + η , contradicting the assumption κ ⊤ w < B + η required for d + w ∈ C op . Therefore, no such w > 0 can mak e a shortest s – t path use r . Step 3: Reduction from P I S P P - W + to a comparison of decision-relev ant dimensions. W e now translate the PISPP-W+ instance into a linear optimization problem o ver the s – t unit-flo w polytope. The coordinate index ed by the required arc r will play the role of the distinguished coordinate in ( 16 ) , and we will compare (the dimensions of) reachable optimal directions with and without imposing x r = 0 . Gi ven the instance ( G, d, κ, B , r ) , let p := | A | and index coordinates of R p by arcs. Define the s – t unit-flo w polytope X := n x ∈ R p ≥ 0 : X a ∈ δ + ( v ) x a − X a ∈ δ − ( v ) x a = 1 v = s, − 1 v = t, 0 otherwise, o . Because G is ac yclic, the extreme points of X are e xactly the incidence v ectors of s – t paths. Let X r := { x ∈ X : x r = 0 } be the face of flo ws that a void the required arc. Note that X r is a face of X , hence X ∠ r = X ∠ ∩ X r = { x ∈ X ∠ : x r = 0 } . (17) 28 For C ∈ {C cl , C op } , define the reachable optimal sets X ⋆ ( C ) and X ⋆ r ( C ) as in Equation ( 3 ). The next lemma is the k ey structural property for the dimension comparison: it identifies the reachable optimal extreme points of the restricted f ace X r and sho ws that all of them are also reachable for X . Lemma 33. F or C ∈ {C cl , C op } constructed above , we have X ⋆ r ( C ) = X ∠ r and X ∠ r ⊆ X ⋆ ( C ) . Pr oof. For C = C cl , let c 0 := d ∈ C (take w = 0 ). For C = C op , let δ be the topological perturbation from the proof of Lemma 32 and set c 0 := d + δ ∈ C op . In either case, by the unit-gap property under d and by telescoping of δ , ev ery s – t path that av oids r is shortest under c 0 , and ev ery path that uses r is strictly longer . Combine with ( 17 ), we hav e X ⋆ ( c 0 ) = { x ∈ X ∠ : x r = 0 } = X ∠ r , and in particular X ∠ r ⊆ X ⋆ ( C ) . Moreov er , for this same c 0 , every extreme point of X r (i.e., every s – t path av oiding r ) is optimal for min { ( c 0 ) ⊤ x : x ∈ X r } , so X ∠ r ⊆ X ⋆ r ( C ) . The rev erse inclusion X ⋆ r ( C ) ⊆ X ∠ r holds by definition, hence X ⋆ r ( C ) = X ∠ r . W e are now ready to complete the reduction. T o prove Theorem 30 , it suf fices to establish the following claim. Claim 34. The formula φ is satisfiable if and only if dim dir( X ⋆ ( C )) > dim dir( X ⋆ r ( C )) . Pr oof. By Lemma 33 , we always ha ve X ⋆ r ( C ) ⊆ X ⋆ ( C ) , hence dir( X ⋆ r ( C )) ⊆ dir( X ⋆ ( C )) . (18) ( ⇒ ) Assume φ is satisfiable. W e claim that there exists c + ∈ C such that some shortest s – t path w .r .t. c + contains r . If C = C cl , this follows from Lemma 31 (v) (equi valently , Ley and Merkert ( 2025 , Theorem 3.1)); if C = C op , it follo ws from Lemma 32 . Let x + ∈ X ⋆ ( c + ) ∩ X ∠ ⊆ X ⋆ ( C ) be the incidence vector of such a shortest path, so x + r = 1 . By Lemma 33 , we also hav e X ∠ r ⊆ X ⋆ ( C ) ; pick any x 0 ∈ X ∠ r , so x 0 r = 0 . Then v := x + − x 0 ∈ dir X ⋆ ( C ) and satisfies v r = 1 . On the other hand, every u ∈ dir X ⋆ r ( C ) satisfies u r = 0 , and therefore v / ∈ dir X ⋆ r ( C ) . Combining this with ( 18 ) yields dir X ⋆ r ( C ) ⊊ dir X ⋆ ( C ) , and thus dim dir( X ⋆ ( C )) > dim dir( X ⋆ r ( C )) . ( ⇐ ) W e prov e the contrapositiv e. Assume φ is not satisfiable. Then, for C = C cl the corresponding PISPP-W+ instance is infeasible by Ley and Merk ert ( 2025 ), and for C = C op it is infeasible by Lemma 32 . Equi valently , for ev ery c ∈ C , ev ery extreme optimal solution of min { c ⊤ x : x ∈ X } av oids the arc r , i.e., satisfies x r = 0 . Since X r is a face of X , we hav e X ∠ r = { x ∈ X ∠ : x r = 0 } , and therefore X ⋆ ( c ) ∩ X ∠ ⊆ X ∠ r ∀ c ∈ C . T aking the union over c ∈ C gi ves X ⋆ ( C ) ⊆ X ∠ r . On the other hand, Lemma 33 yields X ∠ r ⊆ X ⋆ ( C ) and X ⋆ r ( C ) = X ∠ r . Hence X ⋆ ( C ) = X ⋆ r ( C ) , and thus dir( X ⋆ ( C )) = dir( X ⋆ r ( C )) . This completes the reduction. Therefore deciding whether ( 16 ) holds is NP -hard. Finally , to deduce the hardness of computing dim dir( X ⋆ ( C )) as a function problem, note that if we could compute dim dir( X ⋆ ( C )) in polynomial time, we could compute both sides of ( 16 ) (one call on ( X , C ) and one call on ( X r , C ) ) and decide the inequality , implying NP -hardness under polynomial-time T uring reductions. 29 A.2 Proof of Theor em 9 Theorem 35 (Formal version of Theorem 9 ) . The following decision pr oblem is coNP -har d: given a bounded polytope X ⊆ R n , a polyhedr al uncertainty set C ⊆ R n specified in H -r epr esentation, a dataset D , and a cost vector c ∈ C , decide whether D is pointwise sufficient at c in the sense of Definition 6 . Consequently , computing the size of a minimum pointwise SDD (and the corr esponding sear ch pr oblem of finding one) is coNP -har d. The H -in- V polytope containment problem. An instance consists of two polytopes P, Q ⊆ R d , where P = { z ∈ R d : H z ≤ h } and Q = con v { v 1 , . . . , v M } , i.e., P is gi ven in H -r epr esentation and Q is gi ven in V -r epr esentation . The decision problem asks whether P ⊆ Q . This problem is coNP -complete ( Freund and Orlin , 1985 ). Moreov er , the coNP -hardness persists under strong structural restrictions; in particular , it is already coNP -complete to decide whether the standard cube is contained in an affine image of a cross polytope ( Gritzmann and Klee , 1993 ). A con venient modern reference is Kellner and Theobald ( 2016 , Proposition 2.1). In our reduction, we work with the following con venient hard family: given Q = conv { v 1 , . . . , v M } ⊆ R d that is full-dimensional and satisfies 0 ∈ int( Q ) , decide whether the hypercube P = [ − 1 , 1] d is contained in Q . Construction of a standard-form LP instance. Let P = [ − 1 , 1] d and Q = conv { v 1 , . . . , v M } be such a hard instance. Set n 0 := d + 1 and define the homogenized vectors ¯ v i := ( v i , 1) ∈ R n 0 . Let ¯ V ∈ R M × n 0 be the matrix whose i th row is ¯ v ⊤ i , and set β := ¯ V 1 ∈ R M , where 1 denotes the all-ones vector . W e introduce v ariables w , r ∈ R n 0 and s ∈ R M , and write z := ( w , r, s ) ∈ R n with n := 2 n 0 + M . Define the standard-form polytope X := n z ∈ R n : I n 0 I n 0 0 ¯ V 0 − I M | {z } =: A z = 2 · 1 β | {z } =: b , z ≥ 0 o . (19) The matrix A has full row rank, and X is nonempty and bounded (indeed, w + r = 2 · 1 implies 0 ≤ w ≤ 2 · 1 and 0 ≤ r ≤ 2 · 1 , while s = ¯ V w − β is then bounded as well). Let z 0 := ( 1 , 1 , 0) ∈ X . W e consider the empty dataset D = ∅ and the cost vector c 0 := ( e n 0 , 0 , 0) ∈ R n , where e n 0 ∈ R n 0 is the last standard basis vector . Finally , define the polyhedral uncertainty set C := (( y , 1) , 0 , 0) ∈ R n : y ∈ P . Note that c 0 ∈ C since 0 ∈ [ − 1 , 1] d . W e claim that D = ∅ is pointwise suf ficient at c 0 if and only if P ⊆ Q . Step 1: the optimality cone at z 0 . F or a standard-form polytope X = { z : Az = b, z ≥ 0 } and an extreme point z 0 , the optimality cone can be written via KKT as Λ( z 0 ) = n c ∈ R n : ∃ y , ρ s.t. c = A ⊤ y + ρ, ρ ≥ 0 , ρ j = 0 whene ver ( z 0 ) j > 0 o . (20) 30 Here z 0 = ( 1 , 1 , 0) has strictly positi ve components on the w - and r -coordinates and zeros on the s - coordinates. Writing y = ( α, µ ) ∈ R n 0 × R M , we hav e A ⊤ y = α + ¯ V ⊤ µ, α, − µ . Therefore, for costs of the form (( ˜ y , ˜ t ) , 0 , 0) we obtain the characterization (( ˜ y , ˜ t ) , 0 , 0) ∈ Λ( z 0 ) ⇐ ⇒ ( ˜ y , ˜ t ) ∈ cone { ¯ v 1 , . . . , ¯ v M } . (21) Step 2: z 0 is the unique minimizer f or c 0 . Let u := ( x, t ) := w − 1 ∈ R d × R . From ( 19 ) , feasibility implies w ∈ [0 , 2] n 0 and s i = ¯ v ⊤ i w − β i = ¯ v ⊤ i ( w − 1 ) = v ⊤ i x + t ≥ 0 ∀ i ∈ [ M ] . Thus the projection z 7→ u = w − 1 identifies X with the bounded polytope e X := { ( x, t ) ∈ R d × R : v ⊤ i x + t ≥ 0 ∀ i ∈ [ M ] , − 1 ≤ ( x, t ) ≤ 1 } . Moreov er , minimizing c ⊤ 0 z ov er X is equi valent (up to an additi ve constant) to minimizing t ov er e X , since c ⊤ 0 z = e ⊤ n 0 w = t + 1 . W e no w show that (0 , 0) is the unique minimizer of min { t : ( x, t ) ∈ e X } . First, we claim that e very feasible point satisfies t ≥ 0 . Indeed, if t < 0 then v ⊤ i x ≥ − t > 0 for all i , hence z ⊤ x > 0 for all z ∈ Q = con v { v 1 , . . . , v M } . But 0 ∈ int( Q ) implies that for any x = 0 there exists ε > 0 such that − εx/ ∥ x ∥ ∈ Q , which yields ( − εx/ ∥ x ∥ ) ⊤ x < 0 , a contradiction. Thus t ≥ 0 . Therefore the minimum value of t equals 0 (since (0 , 0) ∈ e X ). When t = 0 , feasibility requires v ⊤ i x ≥ 0 for all i , which again forces x = 0 by the same argument using 0 ∈ in t( Q ) . Hence ( x, t ) = (0 , 0) is the unique minimizer in e X , and consequently z 0 = ( 1 , 1 , 0) is the unique minimizer in X : X ⋆ ( c 0 ) = { z 0 } . Step 3: pointwise sufficiency reduces to cone containment. Since D = ∅ , the data-consistent fiber equals C . Pointwise suf ficiency at c 0 requires a decision z ⋆ ∈ X that is optimal for all costs in C (and in particular for c 0 ). By Step 2, this forces z ⋆ = z 0 . Therefore, D = ∅ is pointwise sufficient at c 0 ⇐ ⇒ z 0 ∈ X ⋆ ( c ) ∀ c ∈ C ⇐ ⇒ C ⊆ Λ( z 0 ) . Step 4: C ⊆ Λ( z 0 ) iff P ⊆ Q . By ( 21 ), for any y ∈ R d we hav e (( y , 1) , 0 , 0) ∈ Λ( z 0 ) ⇐ ⇒ ( y , 1) ∈ cone { ¯ v 1 , . . . , ¯ v M } ⇐ ⇒ y ∈ Q, where the last equi valence uses that ( y , 1) = P M i =1 α i ( v i , 1) with α i ≥ 0 holds if and only if P M i =1 α i = 1 and y = P M i =1 α i v i . Applying this pointwise ov er y ∈ P = [ − 1 , 1] d yields C ⊆ Λ( z 0 ) ⇐ ⇒ P ⊆ Q . This completes a polynomial-time reduction from H -in- V containment to checking pointwise suf ficiency , proving coNP -hardness. Finally , since our reduction uses D = ∅ , deciding whether the minimum pointwise SDD size equals 0 is already coNP -hard; hence, computing the minimum size (and producing a minimum pointwise SDD) is coNP -hard. 31 A.3 Proof of Theor em 10 Theorem 36 (Formal version of Theorem 10 ) . The following decision pr oblem is coNP -har d: given a bounded polytope X ⊆ R n and a polyhedral uncertainty set C ⊆ R n specified in H -r epr esentation, decide whether the empty dataset D = ∅ is a global SDD for ( X , C ) in the sense of Definition 1 . The hardness persists even when C is an open, full-dimensional polyhedr on. Consequently , computing the size of a minimum global SDD (and the corr esponding sear ch pr oblem of finding one) is coNP -hard. Pr oof. W e reduce from the same restricted H -in- V polytope containment problem used in the proof of Theorem 9 : giv en a full-dimensional polytope Q = conv { v 1 , . . . , v M } ⊆ R d with 0 ∈ int( Q ) , decide whether P = [ − 1 , 1] d ⊆ Q , which is coNP -complete. W e reuse the standard-form polytope X from ( 19 ) . In particular , X = { z = ( w, r, s ) ∈ R n : Az = b, z ≥ 0 } with n = 2( d + 1) + M , and it contains the distinguished verte x z 0 = ( 1 , 1 , 0) . A full-dimensional open uncertainty set. Write costs as c = ( c w , c r , c s ) ∈ R n 0 × R n 0 × R M with n 0 = d + 1 . Define the effective cost on w by the linear map T ( c ) := c w − c r + ¯ V ⊤ c s ∈ R n 0 . (22) Indeed, for any feasible z = ( w , r, s ) ∈ X we hav e r = 2 · 1 − w and s = ¯ V w − β , so c ⊤ z = c ⊤ w w + c ⊤ r (2 · 1 − w ) + c ⊤ s ( ¯ V w − β ) = T ( c ) ⊤ w + 2 c ⊤ r 1 − c ⊤ s β | {z } constant over X . (23) Therefore arg min z ∈ X c ⊤ z depends on c only through T ( c ) . Let B op ⊆ R n 0 denote the open polyhedron B op := n ( ˜ y , ˜ t ) ∈ R d × R : 1 2 < ˜ t < 3 2 , − ˜ t < ˜ y j < ˜ t ∀ j ∈ [ d ] o . W e define the uncertainty set as the preimage C op := { c ∈ R n : T ( c ) ∈ B op } . (24) Since T is linear and B op is an open polyhedron, C op is also an open polyhedron. Moreov er , it is full- dimensional because it is open and nonempty (e.g., ((0 , 1) , 0 , 0) ∈ C op ). W e claim that D = ∅ is a global SDD for ( X , C op ) ⇐ ⇒ P ⊆ Q. (25) Since P ⊆ Q is coNP -hard, this prov es the theorem. Step 1: Containment is equivalent to a cone inclusion. Recall that cone { ¯ v 1 , . . . , ¯ v M } = { ( tz , t ) : t ≥ 0 , z ∈ Q } . In particular , for any ˜ t > 0 we ha ve ( ˜ y , ˜ t ) ∈ cone { ¯ v 1 , . . . , ¯ v M } ⇐ ⇒ ˜ y / ˜ t ∈ Q. Therefore, B op ⊆ cone { ¯ v 1 , . . . , ¯ v M } ⇐ ⇒ ( − 1 , 1) d ⊆ Q. Because Q is closed, ( − 1 , 1) d ⊆ Q holds if and only if [ − 1 , 1] d ⊆ Q , i.e., P ⊆ Q . 32 Step 2: If P ⊆ Q , then D = ∅ is a global SDD. Assume P ⊆ Q . By Step 1 we hav e B op ⊆ cone { ¯ v 1 , . . . , ¯ v M } , and since B op is open this implies B op ⊆ int cone { ¯ v 1 , . . . , ¯ v M } . Next, note that w uniquely determines ( r , s ) via the equalities w + r = 2 · 1 and ¯ V w − s = β . Thus minimizing c ⊤ z over X is equiv alent to minimizing T ( c ) ⊤ w over the projected feasible set W := { w ∈ R n 0 : ∃ ( r, s ) ≥ 0 s.t. ( w , r , s ) ∈ X } . Equi valently , with the change of v ariables u := w − 1 = ( x, t ) , this projected set corresponds to the bounded polytope e X = { ( x, t ) ∈ R d × R : v ⊤ i x + t ≥ 0 ∀ i ∈ [ M ] , − 1 ≤ ( x, t ) ≤ 1 } . At u 0 := (0 , 0) ∈ e X , the box constraints are slack and the active constraints are v ⊤ i x + t ≥ 0 , i.e., − ¯ v ⊤ i u ≤ 0 . Hence Λ( u 0 ) = cone { ¯ v 1 , . . . , ¯ v M } . If ( ˜ y , ˜ t ) ∈ int(Λ( u 0 )) , then u 0 is the unique minimizer for this cost: for any u ∈ e X \ { u 0 } the direction u − u 0 is a nonzero feasible direction at u 0 , so ( ˜ y , ˜ t ) ⊤ ( u − u 0 ) > 0 and hence ( ˜ y , ˜ t ) ⊤ u > ( ˜ y , ˜ t ) ⊤ u 0 . No w fix any c ∈ C op . Then T ( c ) ∈ B op ⊆ int(Λ( u 0 )) , so the unique minimizer of min { T ( c ) ⊤ u : u ∈ e X } is u 0 . By ( 23 ) , the minimizer set of min z ∈ X c ⊤ z is therefore the singleton { z 0 } . Thus X ⋆ ( c ) = { z 0 } for all c ∈ C op , and the constant rule b X ( ∅ ) := { z 0 } makes D = ∅ a global SDD. Step 3: If P ⊆ Q , then no zer o-query global SDD exists. Now assume P ⊆ Q . By Step 1 there e xists ( ˜ y , ˜ t ) ∈ B op such that ( ˜ y , ˜ t ) / ∈ cone { ¯ v 1 , . . . , ¯ v M } . Let c 1 := (( ˜ y , ˜ t ) , 0 , 0) ∈ C op , so T ( c 1 ) = ( ˜ y , ˜ t ) . Since ( ˜ y , ˜ t ) / ∈ Λ( u 0 ) , we hav e u 0 / ∈ e X ⋆ (( ˜ y , ˜ t )) , and thus z 0 / ∈ X ⋆ ( c 1 ) . On the other hand, c 0 := ((0 , 1) , 0 , 0) ∈ C op and by the argument in the proof of Theorem 9 (Step 2 there), we hav e X ⋆ ( c 0 ) = { z 0 } . Therefore X ⋆ ( c 1 ) = X ⋆ ( c 0 ) , i.e., the optimal solution set is not constant ov er c ∈ C op . Since |D | = 0 , any candidate decoder b X : R |D| → P ( X ) is necessarily constant. It cannot match two distinct optimal solution sets, so no such map can satisfy Definition 1 . This establishes ( 25 ). Consequence f or minimum-size global SDDs. Finally , if one could compute the size of a minimum global SDD (or output such a dataset) in polynomial time, then one could decide whether the optimum size equals 0 , which is exactly the coNP -hard decision problem in ( 25 ). B Pr oofs and technical details for Section 5 B.1 Closed-f orm FI for ellipsoids Recall the face-intersection subproblem FI( δ ; Q, s ) := min { δ ⊤ c : c ∈ C , Q ⊤ c = s } . Proposition 37. Let C = { c ∈ R d : ( c − ¯ c ) ⊤ Σ − 1 ( c − ¯ c ) ≤ R 2 } with Σ ≻ 0 . F ix Q ∈ R d × k with rank( Q ) = k and s ∈ R k , and assume C ( Q, s ) := { c ∈ C : Q ⊤ c = s } = ∅ . Define c ⊥ := ¯ c + Σ Q ( Q ⊤ Σ Q ) − 1 ( s − Q ⊤ ¯ c ) , M ⊥ := Σ − Σ Q ( Q ⊤ Σ Q ) − 1 Q ⊤ Σ ⪰ 0 , and ρ := q R 2 − ( c ⊥ − ¯ c ) ⊤ Σ − 1 ( c ⊥ − ¯ c ) . 33 Then for any δ ∈ R d , min c ∈ C ( Q,s ) δ ⊤ c = δ ⊤ c ⊥ − ρ √ δ ⊤ M ⊥ δ . If δ ⊤ M ⊥ δ > 0 , a minimizer is c out ( δ ) = c ⊥ − ρ M ⊥ δ √ δ ⊤ M ⊥ δ . If δ ⊤ M ⊥ δ = 0 , then δ ⊤ c is constant over C ( Q, s ) . Pr oof. Let Σ 1 / 2 be the symmetric square root of Σ (so Σ = Σ 1 / 2 Σ 1 / 2 ). Change v ariables z := Σ − 1 / 2 ( c − ¯ c ) ⇐ ⇒ c = ¯ c + Σ 1 / 2 z . Then the ellipsoid constraint becomes ∥ z ∥ 2 ≤ R . The equality constraint becomes Q ⊤ c = s ⇐ ⇒ Q ⊤ (¯ c + Σ 1 / 2 z ) = s ⇐ ⇒ (Σ 1 / 2 Q ) ⊤ z = s − Q ⊤ ¯ c. Define e Q := Σ 1 / 2 Q and e s := s − Q ⊤ ¯ c , and also e δ := Σ 1 / 2 δ . Up to the constant δ ⊤ ¯ c , the FI problem is equi valent to min { e δ ⊤ z : ∥ z ∥ 2 ≤ R , e Q ⊤ z = e s } . Let z 0 := e Q ( e Q ⊤ e Q ) − 1 e s, the unique minimum- ℓ 2 -norm solution of e Q ⊤ z = e s . Every feasible z can be written uniquely as z = z 0 + v with v ∈ ker( e Q ⊤ ) . Since z 0 ∈ span( e Q ) and k er( e Q ⊤ ) = span( e Q ) ⊥ , we hav e the orthogonal decomposition ∥ z ∥ 2 2 = ∥ z 0 ∥ 2 2 + ∥ v ∥ 2 2 . Thus feasibility is equiv alent to ∥ z 0 ∥ 2 ≤ R (which holds since the fiber is nonempty) and ∥ v ∥ 2 ≤ ρ z := q R 2 − ∥ z 0 ∥ 2 2 . Let P := I − e Q ( e Q ⊤ e Q ) − 1 e Q ⊤ denote the orthogonal projector onto k er( e Q ⊤ ) . Then for v ∈ ker( e Q ⊤ ) , e δ ⊤ v = ( P e δ ) ⊤ v . Therefore, min v ∈ ker( e Q ⊤ ) ∥ v ∥ 2 ≤ ρ z e δ ⊤ v = min ∥ v ∥ 2 ≤ ρ z ( P e δ ) ⊤ v = − ρ z ∥ P e δ ∥ 2 , with minimizer v ⋆ = − ρ z P e δ ∥ P e δ ∥ 2 when P e δ = 0 (and an y v when P e δ = 0 ). Hence, the optimal v alue in z -space is e δ ⊤ z 0 − ρ z ∥ P e δ ∥ 2 , and the optimal c is c = ¯ c + Σ 1 / 2 ( z 0 + v ⋆ ) . It remains to express e verything back in the original v ariables. First, note that e Q ⊤ e Q = Q ⊤ Σ Q and Σ 1 / 2 z 0 = Σ 1 / 2 e Q ( e Q ⊤ e Q ) − 1 e s = Σ Q ( Q ⊤ Σ Q ) − 1 ( s − Q ⊤ ¯ c ) . Thus c ⊥ = ¯ c + Σ 1 / 2 z 0 = ¯ c + Σ Q ( Q ⊤ Σ Q ) − 1 ( s − Q ⊤ ¯ c ) . Second, ∥ z 0 ∥ 2 2 = z ⊤ 0 z 0 = ( c ⊥ − ¯ c ) ⊤ Σ − 1 ( c ⊥ − ¯ c ) , so ρ z = ρ as defined in the statement. 34 Finally , ∥ P e δ ∥ 2 2 = e δ ⊤ P e δ = δ ⊤ Σ − Σ Q ( Q ⊤ Σ Q ) − 1 Q ⊤ Σ δ = δ ⊤ M ⊥ δ. Also, Σ 1 / 2 P e δ = Σ − Σ Q ( Q ⊤ Σ Q ) − 1 Q ⊤ Σ δ = M ⊥ δ. Substituting these identities yields the claimed closed-form expressions for the minimum value and the optimizer . B.2 Proof of Pr operty 21 Pr oof. W e bound the computational work in one iteration of Algorithm 1 . Each iteration performs the follo wing optimization subroutines. (i) One LP over X . Line 5 solves min { ( c in ) ⊤ x : Ax = b, x ≥ 0 } . Since X is gi ven in standard form, this LP can be solv ed in time polynomial in the bit complexity of the input. (ii) d − m face-intersection subproblems over C . For each j ∈ N (so | N | = d − m ), line 6 of Algorithm 1 solves min { δ ⊤ j c ′ : c ′ ∈ C , Q ⊤ k c ′ = s k } . If C is a polytope gi ven in H -representation, C = { c : Gc ≤ h } , then this is the LP min { δ ⊤ j c ′ : Gc ′ ≤ h, Q ⊤ k c ′ = s k } , whose size is polynomial in the input (in particular, it has dimension d and k ≤ d ⋆ equality constraints). Hence, each FI call can be sol ved in polynomial time. If instead C is an ellipsoid, Proposition 37 gi ves a closed-form expression for the optimal v alue and a minimizer , which can be computed in polynomial time (it amounts to solving a k × k linear system and basic matrix–vector operations). Thus, each iteration runs in polynomial time and makes at most d − m FI calls. Finally , by Theorem 20 , Algorithm 1 executes at most d ⋆ + 1 ≤ d + 1 iterations and makes at most d ⋆ oracle queries of the form q ⊤ c . Combining the per -iteration bound with the iteration bound yields the claimed ov erall polynomial running time. C Pr oofs and technical details for Section 6 Pr oof of Theor em 26 . W e giv e an explicit construction in which the intrinsic dimension d ⋆ and the ambient dimension can be chosen independently . F easible region. Fix integers d ≥ d ⋆ ≥ 2 . Let m := d , set A := [ I d I d ] ∈ R d × 2 d and b := 1 ∈ R d , and define the lifted polytope X := { z = ( x, s ) ∈ R 2 d : Az = b, z ≥ 0 } = { ( x, s ) : x + s = 1 , x ≥ 0 , s ≥ 0 } , which is an extended formulation of the hypercube [0 , 1] d obtained by introducing slack variables. Then X is bounded and nondegenerate: at ev ery extreme point, for each j ∈ [ d ] exactly one of x j , s j equals 1 and the other equals 0 , so e very extreme point has e xactly m = d strictly positi ve components. 35 Prior set and a rare-types distrib ution. Let µ ∈ R d be defined coordinate wise by µ j = ( 0 . 99 , j ∈ { 1 , . . . , d ⋆ } , 10 , j ∈ { d ⋆ + 1 , . . . , d } , and define the lifted center ¯ µ := ( µ, 0) ∈ R 2 d . Let the con ve x prior set be the lifted radius- 1 Euclidean ball C := { ( c, 0) ∈ R 2 d : ∥ c − µ ∥ 2 ≤ 1 } . For each i ∈ { 1 , . . . , d ⋆ } define the lifted costs and query directions c ( i ) := ( µ − e i , 0) ∈ C , δ i := ( − e i , e i ) ∈ R 2 d . Fix ε ∈ (0 , 1 / 4) and let k := d ⋆ − 1 . Define a distribution P c supported on { c (1) , . . . , c ( d ⋆ ) } ⊆ C by P ( c = c (1) ) = 1 − 2 ε, P ( c = c ( i ) ) = 2 ε k , i = 2 , . . . , d ⋆ . W e call c (1) the common type and { c (2) , . . . , c ( d ⋆ ) } the r ar e types. Step 1: Pro ve that dim dir( X ⋆ ( C )) = d ⋆ . Lemma 38. F or every ( c, 0) ∈ C , every minimizer z ⋆ = ( x ⋆ , s ⋆ ) ∈ arg min ( x,s ) ∈X ( c, 0) ⊤ ( x, s ) satisfies x ⋆ j = 0 and s ⋆ j = 1 for all j > d ⋆ . Mor eover , X ⋆ ( C ) ⊇ { (0 , 1 ) , ( e 1 , 1 − e 1 ) , . . . , ( e d ⋆ , 1 − e d ⋆ ) } . Consequently , dim dir( X ⋆ ( C )) = d ⋆ . Pr oof. Fix ( c, 0) ∈ C and any feasible ( x, s ) ∈ X . Since x + s = 1 , we have ( c, 0) ⊤ ( x, s ) = c ⊤ x. Thus, the LP objecti ve depends only on x ∈ [0 , 1] d . For each j > d ⋆ we ha ve c j ≥ µ j − 1 = 9 > 0 , hence any minimizer must set x ⋆ j = 0 (and therefore s ⋆ j = 1 ) for all j > d ⋆ . Next, ¯ µ = ( µ, 0) ∈ C and µ has strictly positi ve coordinates, so the unique minimizer is ( x, s ) = (0 , 1 ) . For each i ≤ d ⋆ , the cost c ( i ) = ( µ − e i , 0) has c ( i ) i = − 0 . 01 < 0 and all other c ( i ) j > 0 , so the unique minimizer is ( x, s ) = ( e i , 1 − e i ) . This prov es the stated inclusion of X ⋆ ( C ) . Therefore dir( X ⋆ ( C )) contains span { ( e i , − e i ) : i = 1 , . . . , d ⋆ } , so dim dir( X ⋆ ( C )) ≥ d ⋆ . On the other hand, we already showed that e very optimizer has x j = 0 for j > d ⋆ , hence ev ery difference of reachable optima lies in span { ( e i , − e i ) : i = 1 , . . . , d ⋆ } , gi ving dim dir( X ⋆ ( C )) ≤ d ⋆ . Step 2: Without querying δ i , type i cannot be certified. Lemma 39. F ix any i ∈ { 2 , . . . , d ⋆ } and let D ⊆ R 2 d be any dataset that does not contain δ i . If D ⊆ { δ 1 , . . . , δ d ⋆ } , then ¯ µ ∈ C D , s ( c ( i ) ; D ) . Consequently , D is not pointwise sufficient at c ( i ) . Pr oof. Assume D ⊆ { δ 1 , . . . , δ d ⋆ } and δ i / ∈ D . Every query in D is of the form δ j = ( − e j , e j ) with j = i . For such j , δ ⊤ j ¯ µ = ( − e j , e j ) ⊤ ( µ, 0) = − µ j = − ( µ − e i ) j = ( − e j , e j ) ⊤ ( µ − e i , 0) = δ ⊤ j c ( i ) . Hence ¯ µ is consistent with the same measurements as c ( i ) , i.e., ¯ µ ∈ C ( D , s ( c ( i ) ; D )) . But by Lemma 38 , x ⋆ ( ¯ µ ) = 0 (so the unique optimizer is (0 , 1 ) ) while x ⋆ ( c ( i ) ) = e i (unique optimizer ( e i , 1 − e i ) ). Thus, the fiber contains tw o costs with dif ferent unique minimizers, so no single decision can be optimal for all costs in the fiber . Therefore D is not pointwise sufficient at c ( i ) . 36 Step 2’: On type i , the pointwise routine adds only δ i . Lemma 40. F ix i ∈ { 1 , . . . , d ⋆ } . Run Algorithm 1 on c = c ( i ) = ( µ − e i , 0) with initialization D init ⊆ { δ 1 , . . . , δ d ⋆ } such that δ i / ∈ D init . Then the call performs exactly one augmentation and r eturns D = D init ∪ { δ i } . In particular , during this call, the algorithm cannot add any δ j with j = i . Pr oof. Let Q k be the matrix whose columns are the directions in D init and set s k := Q ⊤ k c ( i ) . Since each δ j = ( − e j , e j ) , the fiber C k := { c ′ ∈ C : Q ⊤ k c ′ = s k } fixes the coordinates c ′ j = ( µ − e i ) j = µ j for every δ j ∈ D init and leaves coordinate i unconstrained; all c ′ ∈ C k hav e the form ( ˜ c, 0) with ˜ c ∈ R d . (a) The first LP solve yields the vertex ( e i , 1 − e i ) and its cone. W ith c in = c ( i ) = ( µ − e i , 0) , we hav e ( µ − e i ) i < 0 and ( µ − e i ) j > 0 for all j = i , so the LP over X has the unique minimizer z ⋆ = ( x ⋆ , s ⋆ ) = ( e i , 1 − e i ) . For the matrix A = [ I d I d ] , this verte x corresponds to the unique feasible basis B ( i ) := { x i } ∪ { s j : j = i } , N ( i ) := { s i } ∪ { x j : j = i } . For j = i , increasing the nonbasic variable x j from 0 decreases s j from 1 to 0 , hence δ ( B ( i ) , x j ) = ( e j , − e j ) . Increasing the nonbasic v ariable s i from 0 decreases x i from 1 to 0 , hence δ ( B ( i ) , s i ) = ( − e i , e i ) = δ i . Therefore, the optimality cone ( 4 ) takes the e xplicit form Λ( B ( i )) = n ( u, v ) ∈ R 2 d : ( u j − v j ) ≥ 0 ∀ j = i, ( − u i + v i ) ≥ 0 o . Since e very c ′ ∈ C k has v = 0 , we hav e C k ⊆ { ( u, 0) : u ∈ R d } , Λ( B ( i )) ∩ { ( u, 0) } = { ( u, 0) : u j ≥ 0 ∀ j = i, u i ≤ 0 } . (b) Among facets of Λ( B ( i )) , only the facet f or δ i can be hit fr om within C k . Because coordinate i is free in C k , the point ˜ c := ( µ + e i , 0) belongs to C k (and to C ) but has ˜ c i > 0 , hence ˜ c / ∈ Λ( B ( i )) ∩ { ( u, 0) } . Therefore C k ⊆ Λ( B ( i )) , the containment test fails, and Algorithm 1 enters the E L S E branch, producing some witness c out ∈ C k \ Λ( B ( i )) and considering the segment c α := (1 − α ) c in + αc out . W e claim that for ev ery j = i , C ∩ Λ( B ( i )) ∩ { ( u, v ) : ( u j − v j ) = 0 } = ∅ . Indeed, take any c = ( u, v ) ∈ C ∩ Λ( B ( i )) . Since c ∈ C we hav e v = 0 and ∥ u − µ ∥ 2 ≤ 1 . Moreov er , c ∈ Λ( B ( i )) implies u i ≤ 0 . Since µ i = 0 . 99 > 0 , ( u i − µ i ) 2 ≥ (0 − µ i ) 2 = µ 2 i . 37 Thus, X t = i ( u t − µ t ) 2 ≤ 1 − ( u i − µ i ) 2 ≤ 1 − µ 2 i , and hence for e very j = i , | u j − µ j | ≤ q 1 − µ 2 i ⇒ u j ≥ µ j − q 1 − µ 2 i . W ith µ j ∈ { 0 . 99 , 10 } and q 1 − µ 2 i = √ 1 − 0 . 99 2 < 0 . 15 , we get u j > 0 for all j = i . Therefore u j − v j = u j > 0 for all j = i , proving the claim. No w let α ⋆ be the first parameter where the segment lea ves relin t(Λ( B ( i ))) . Then c α ⋆ ∈ C k ∩ Λ( B ( i )) lies on a facet hyperplane of Λ( B ( i )) . By the claim above, it cannot lie on any facet ( u j − v j ) = 0 with j = i ; hence, it must lie on the facet ( − u i + v i ) = 0 , whose normal is e xactly δ ( B ( i ) , s i ) = δ i . Therefore, the facet-hit rule appends q k +1 = δ i . (c) After adding δ i , the fiber becomes a singleton and the routine terminates. Appending δ i = ( − e i , e i ) adds the constraint δ ⊤ i ( c ′ , 0) = δ ⊤ i ( µ − e i , 0) , i.e., − u i = (1 − µ i ) and hence u i = µ i − 1 . Then ( u i − µ i ) 2 = 1 , and the radius- 1 constraint ∥ u − µ ∥ 2 2 ≤ 1 forces P t = i ( u t − µ t ) 2 = 0 , hence u = µ − e i and c ′ = c ( i ) . Therefore the updated fiber is the singleton { c ( i ) } , the containment test succeeds, and Algorithm 1 terminates after exactly one augmentation, returning D = D init ∪ { δ i } . Step 3: A coupon-collector lower bound for Algorithm 2 . Let I ⊆ { 2 , . . . , d ⋆ } be the set of rare indices that appear at least once among the n i.i.d. samples. By Lemma 40 , Algorithm 2 learns δ i if and only if i ∈ I . By Lemma 39 , the final dataset D n fails on e very rare type i / ∈ I . Therefore R ( D n ) ≥ X i ∈{ 2 ,...,d ⋆ }\ I P ( c = c ( i ) ) = 2 ε k { 2 , . . . , d ⋆ } \ I . Let N be the number of rare samples among c 1 , . . . , c n : N = n X t =1 1 { c t = c (1) } ∼ Binomial( n, 2 ε ) . Since | I | ≤ N , the event N < k / 2 implies |{ 2 , . . . , d ⋆ } \ I | ≥ k / 2 , and hence R ( D n ) ≥ ε . It remains to lo wer bound P ( N < k / 2) . If n ≤ k / (8 ε ) , then E [ N ] = 2 εn ≤ k / 4 , and Markov’ s inequality giv es P N ≥ k 2 ≤ E [ N ] k / 2 ≤ k / 4 k / 2 = 1 2 . Hence P ( N < k / 2) ≥ 1 / 2 , and on this ev ent we have R ( D n ) ≥ ε . This concludes the proof of Theorem 26 . D Pr oofs and technical details for Section 7 D.1 Ellipsoidal lifting Throughout this appendix, we assume the shifted ellipsoidal prior C := { c ∈ R d : ( c − c 0 ) ⊤ Σ − 1 ( c − c 0 ) ≤ 1 } , Σ ≻ 0 , c 0 ∈ R d . 38 For an y orthonormal basis U ∈ R d × t we define the lifting matrix L U := Σ U ( U ⊤ Σ U ) − 1 , and the associated canonical lifting map lift U : R t → R d by lift U ( s ) := c 0 + L U s. Lemma 41. F or any s ∈ R t , lift U ( s ) is the unique solution to min c ∈ R d 1 2 ( c − c 0 ) ⊤ Σ − 1 ( c − c 0 ) s.t. U ⊤ ( c − c 0 ) = s. In particular , it satisfies U ⊤ (lift U ( s ) − c 0 ) = s and (lift U ( s ) − c 0 ) ⊤ Σ − 1 (lift U ( s ) − c 0 ) = s ⊤ ( U ⊤ Σ U ) − 1 s. Consequently , lift U ( s ) ∈ C whenever s ∈ U ⊤ ( C − c 0 ) . When Σ = I , we have lift U ( s ) = c 0 + U s . Pr oof. Let z := c − c 0 . The problem becomes min z ∈ R d 1 2 z ⊤ Σ − 1 z s.t. U ⊤ z = s, which is exactly the centered-ellipsoid case after the change of v ariables. The Lagrangian is L ( z , λ ) = 1 2 z ⊤ Σ − 1 z − λ ⊤ ( U ⊤ z − s ) . Stationarity gi ves Σ − 1 z − U λ = 0 , hence z = Σ U λ . Imposing the constraint yields s = U ⊤ z = U ⊤ Σ U λ , so λ = ( U ⊤ Σ U ) − 1 s and z = Σ U ( U ⊤ Σ U ) − 1 s = L U s . The claimed identities follo w by direct substitution. D.2 SPO generalization in the decision-sufficient subspace (Proof of Theor em 27 ) W e no w pro ve the Stage II generalization bound in Theorem 27 . The proof has three ingredients: (i) we bound the complexity of the induced decision class x ⋆ ◦ H U ⋆ ,d ⋆ via its Natarajan dimension, (ii) we plug this bound into the Natarajan-dimension generalization theorem for the SPO loss due to El Balghiti et al. ( 2023 ), and (iii) we sho w that under a global sufficient dataset, projecting to W ⋆ and lifting back is decision-preserving, so restricting to the compressed class incurs no approximation error . Step 1: Natarajan dimension of the compressed decision class. W e begin by translating compressed af fine predictors into a multiclass linear prediction problem over labels X ∠ . This is the only place where the intrinsic dimension d ⋆ enters the analysis. Lemma 42 (Natarajan dimension bound) . Let X ⊆ R d be a bounded polytope with extr eme points X ∠ . Let H aff U ⋆ ,d ⋆ be the class of compr essed affine pr edictors H aff U ⋆ ,d ⋆ := n f B ,b ( ξ ) := bc 0 + L U ⋆ B ξ : B ∈ R d ⋆ × p , b ∈ R o , and let F aff U ⋆ ,d ⋆ := x ⋆ ◦ H aff U ⋆ ,d ⋆ be the induced decision class. Then the Natarajan dimension satisfies d N ( F aff U ⋆ ,d ⋆ ) ≤ d ⋆ p + 1 . 39 Pr oof. For any x ∈ X ∠ , f B ,b ( ξ ) ⊤ x = bc ⊤ 0 x + L U ⋆ B ξ ⊤ x = bc ⊤ 0 x + D ( B ) , ( L ⊤ U ⋆ x ) ξ ⊤ E . Define the feature map Ψ aff ( ξ , x ) := " ( L ⊤ U ⋆ x ) ξ ⊤ c ⊤ 0 x # ∈ R d ⋆ p +1 , w B ,b := ( B ) b ∈ R d ⋆ p +1 . Then F aff U ⋆ ,d ⋆ is a subset of the multiclass linear hypothesis class H Ψ aff := { ξ 7→ arg min x ∈X ∠ ⟨ w , Ψ aff ( ξ , x ) ⟩ : w ∈ R d ⋆ p +1 } , because each f B ,b induces the decision rule x ⋆ ( f B ,b ( ξ )) = arg min x ∈X ∠ ⟨ w B ,b , Ψ aff ( ξ , x ) ⟩ . Finally , by Shale v-Shwartz and Ben-David ( 2014 , Theorem 29.7), the Natarajan dimension of H Ψ aff is at most d ⋆ p + 1 , and the same bound holds for its subset F aff U ⋆ ,d ⋆ . Step 2: Uniform SPO generalization in the compressed class. W e now combine Lemma 42 with the SPO generalization bound of El Balghiti et al. ( 2023 ). This yields a uniform con vergence guarantee over H U ⋆ ,d ⋆ with leading complexity term scaling as d ⋆ p . Lemma 43. F or any δ ∈ (0 , 1) , with pr obability at least 1 − δ over an i.i.d. sample S = { ( ξ i , c i ) } n i =1 , we have R SPO ( f ) ≤ ˆ R SPO ( f ) + 2 ω X ( C ) r 2( d ⋆ p + 1) log ( n |X ∠ | 2 ) n + ω X ( C ) r log(1 /δ ) 2 n , simultaneously for all f ∈ H U ⋆ ,d ⋆ , wher e ω X ( C ) := sup c ∈C max x ∈X c ⊤ x − min x ∈X c ⊤ x . Pr oof. This follows from the Natarajan-dimension generalization bound for SPO loss in El Balghiti et al. ( 2023 ). The SPO loss is uniformly bounded by ω X ( C ) when c ∈ C . Using Lemma 42 and the inclusion H U ⋆ ,d ⋆ ⊆ H aff U ⋆ ,d ⋆ (take b = 1 ), we hav e d N ( x ⋆ ◦ H U ⋆ ,d ⋆ ) ≤ d N ( F aff U ⋆ ,d ⋆ ) ≤ d ⋆ p + 1 , which yields the stated bound. Step 3: Global sufficiency implies lossless compression. T o prove the first statement in Theorem 27 , we sho w that if D is globally sufficient on C , then compressing any cost vector to W ⋆ and lifting it back to C leav es the oracle decision unchanged. Lemma 44. Assume D is globally sufficient on C , W ⋆ := span( D ) and let U ⋆ ∈ R d × d ⋆ be an orthonormal basis of W ⋆ . Recall that L U ⋆ := Σ U ⋆ ( U ⊤ ⋆ Σ U ⋆ ) − 1 and lift U ⋆ ( s ) := c 0 + L U ⋆ s . F ix a deterministic tie-br eaking rule so that x ⋆ ( · ) is single-valued. Then for any ˆ c ∈ C , letting ˜ c := lift U ⋆ U ⊤ ⋆ (ˆ c − c 0 ) , we have x ⋆ (ˆ c ) = x ⋆ (˜ c ) . Consequently , for any pr edictor f : Ξ → C , the compressed pr edictor ˆ f ( ξ ) := lift U ⋆ U ⊤ ⋆ ( f ( ξ ) − c 0 ) induces the same decisions and satisfies ℓ SPO ( f ( ξ ) , c ) = ℓ SPO ( ˆ f ( ξ ) , c ) for all c ∈ C . 40 Pr oof. Fix ˆ c ∈ C and set s := U ⊤ ⋆ (ˆ c − c 0 ) and ˜ c := lift U ⋆ ( s ) = c 0 + L U ⋆ s . By the definition of L U ⋆ , U ⊤ ⋆ (˜ c − c 0 ) = U ⊤ ⋆ L U ⋆ s = U ⊤ ⋆ Σ U ⋆ ( U ⊤ ⋆ Σ U ⋆ ) − 1 s = s = U ⊤ ⋆ (ˆ c − c 0 ) . ( ⋆ ) No w take any q ∈ D ⊆ W ⋆ = span( U ⋆ ) . Write q = U ⋆ a for some a ∈ R d ⋆ . Then using ( ⋆ ) , q ⊤ ˜ c − q ⊤ ˆ c = a ⊤ U ⊤ ⋆ (˜ c − ˆ c ) = a ⊤ U ⊤ ⋆ (˜ c − c 0 ) − U ⊤ ⋆ (ˆ c − c 0 ) = 0 . Hence s (˜ c ; D ) = s ( ˆ c ; D ) . Moreov er , since ˆ c ∈ C we hav e s ∈ U ⊤ ⋆ ( C − c 0 ) and thus Lemma 41 implies ˜ c ∈ C . Global suf ficiency of D on C then yields arg min x ∈X ˆ c ⊤ x = arg min x ∈X ˜ c ⊤ x , and the deterministic tie-breaking rule gi ves x ⋆ (ˆ c ) = x ⋆ (˜ c ) . The predictor claim follo ws by applying the above ar gument pointwise to ˆ c = f ( ξ ) ∈ C . W ith Lemmas 43 and 44 in hand, we can no w complete the proof of Theorem 27 . Pr oof of Theor em 27 . Part (1) follo ws from Lemma 44 by taking f = f ⋆ ∈ H ( C ) : the compressed predictor ˆ f ⋆ ( ξ ) = lift U ⋆ ( U ⊤ ⋆ ( f ⋆ ( ξ ) − c 0 )) induces the same decisions as f ⋆ , hence has the same population SPO risk, and is a risk minimizer in H U ⋆ ,d ⋆ ( C ) . For part (2), apply Lemma 43 , which gi ves a uniform generalization bound o ver the larger class H U ⋆ ,d ⋆ (and therefore also ov er its subset H U ⋆ ,d ⋆ ( C ) ). D.3 A concrete bound f or ordinary least squares Suppose ∥ ξ ∥ 2 ≤ 1 almost surely and the population design cov ariance satisfies Σ ξ := E [ ξ ξ ⊤ ] ⪰ κI p for some κ > 0 . More generally , if ∥ ξ ∥ 2 ≤ C ξ almost surely , the same proof yields the same bound up to an additional multiplicati ve factor C ξ . Assume the regression noise ϵ := c − µ ( ξ ) is conditionally mean-zero and σ -subgaussian in ev ery direction, i.e., for all λ ∈ R and all u ∈ R d with ∥ u ∥ 2 = 1 , E h exp λ u ⊤ ϵ | ξ i ≤ exp( λ 2 σ 2 / 2) . Under the centered linear model ( 11 ) , let ˆ A µ be the (multi-response) OLS estimator based on n µ i.i.d. samples { ( ξ i , c i ) } n µ i =1 by regressing y i := c i − c 0 on ξ i , and define ˆ µ ( ξ ) := c 0 + ˆ A µ ξ , ε 2 µ := E ξ ∥ ˆ µ ( ξ ) − µ ( ξ ) ∥ 2 2 = E ξ ∥ ( ˆ A µ − A µ ) ξ ∥ 2 2 . Lemma 45. F ix δ µ ∈ (0 , 1) and assume n µ ≥ 8 κ log 2 p δ µ . Then, with pr obability at least 1 − δ µ over the r e gr ession sample, the OLS estimator is well-defined and ∥ ˆ A µ − A µ ∥ F ≤ C reg · σ √ κ v u u t d p + log 4 d δ µ n µ , where one may tak e C reg = 4 √ 2 . (26) Consequently , sup ∥ ξ ∥ 2 ≤ 1 ∥ ˆ µ ( ξ ) − µ ( ξ ) ∥ 2 ≤ C reg · σ √ κ v u u t d p + log 4 d δ µ n µ , (27) 41 and, in particular , ε µ ≤ C reg · σ √ κ v u u t d p + log 4 d δ µ n µ . (28) Pr oof. Write the re gression sample in centered form y i = c i − c 0 = A µ ξ i + ϵ i , where E [ ϵ i | ξ i ] = 0 and ϵ i is σ -subgaussian in ev ery direction. Let ˆ Σ := 1 n µ n µ X i =1 ξ i ξ ⊤ i and X ∈ R n µ × p be the design matrix with ro ws ξ ⊤ i . Then X ⊤ X = n µ ˆ Σ . This proof follo ws a standard non-asymptotic OLS argument: we first lower bound the minimum eigen v alue of the empirical Gram matrix via a matrix Chernof f bound ( T ropp , 2012 ), and then control the self-normalized noise term ∥ ( X ⊤ X ) − 1 / 2 X ⊤ ϵ ( j ) ∥ 2 using sub-Gaussian concentration together with an ε -net (sphere cov ering) argument ( V ershynin , 2018 ). Step 1: a lower bound on λ min ( ˆ Σ) . Set Y i := ξ i ξ ⊤ i , so each Y i ⪰ 0 and λ max ( Y i ) = ∥ ξ i ∥ 2 2 ≤ 1 a.s. Moreov er , E [ Y i ] = E [ ξ ξ ⊤ ] = Σ ξ ⪰ κI p . Applying the matrix Chernof f bound ( Tropp , 2012 , Theorem 1.1) with δ = 1 / 2 yields P λ min n µ X i =1 Y i ! ≤ 1 2 λ min n µ X i =1 E Y i !! ≤ p " e − 1 / 2 (1 / 2) 1 / 2 # n µ κ ≤ p e − n µ κ/ 8 . Under the assumed sample size condition, p e − n µ κ/ 8 ≤ δ µ / 2 ; hence with probability at least 1 − δ µ / 2 , λ min ( X ⊤ X ) = n µ λ min ( ˆ Σ) ≥ n µ κ 2 . In particular , X ⊤ X is in vertible, and OLS is well-defined on this e vent. Step 2: r ow-wise OLS coefficient err or . Let β ⊤ j denote the j -th row of A µ and ˆ β ⊤ j the j -th row of ˆ A µ . Then the scalar response model for coordinate j is y i,j = β ⊤ j ξ i + ϵ i,j , and the OLS error satisfies ˆ β j − β j = ( X ⊤ X ) − 1 X ⊤ ϵ ( j ) , where ϵ ( j ) := ( ϵ 1 ,j , . . . , ϵ n µ ,j ) ∈ R n µ . Define the normalized noise vector g j := ( X ⊤ X ) − 1 / 2 X ⊤ ϵ ( j ) ∈ R p . Conditioned on X , for any u ∈ R p with ∥ u ∥ 2 = 1 , u ⊤ g j = u ⊤ ( X ⊤ X ) − 1 / 2 X ⊤ ϵ ( j ) = a ⊤ ϵ ( j ) , where a := X ( X ⊤ X ) − 1 / 2 u ∈ R n µ . Note ∥ a ∥ 2 2 = u ⊤ ( X ⊤ X ) − 1 / 2 X ⊤ X ( X ⊤ X ) − 1 / 2 u = ∥ u ∥ 2 2 = 1 . Since { ϵ i,j } n µ i =1 are independent and each is σ -subgaussian, we hav e for all λ ∈ R , E h exp λ u ⊤ g j | X i = n µ Y i =1 E [exp( λa i ϵ i,j ) | X ] ≤ n µ Y i =1 exp( λ 2 σ 2 a 2 i / 2) = exp( λ 2 σ 2 / 2) . 42 Thus, conditional on X , u ⊤ g j is σ -subgaussian for ev ery unit vector u . Let N be a 1 / 2 -net of the Euclidean unit sphere in R p with |N | ≤ 5 p ( V ershynin , 2018 ). For an y fixed u ∈ N and any t > 0 , subgaussian tails imply P | u ⊤ g j | ≥ σ √ 2 t | X ≤ 2 e − t . T aking a union bound ov er N and choosing t := p log 5 + log (2 /δ j ) gi ves P max u ∈N | u ⊤ g j | ≥ σ √ 2 t X ≤ |N | 2 e − t ≤ δ j . On the complementary e vent, the standard net argument yields ∥ g j ∥ 2 ≤ 2 max u ∈N | u ⊤ g j | , so with conditional probability at least 1 − δ j , ∥ g j ∥ 2 ≤ 2 σ √ 2 t = 2 σ r 2 p log 5 + log(2 /δ j ) ≤ 4 σ q p + log (2 /δ j ) , where we used log 5 ≤ 2 and log(2 /δ j ) ≥ 0 . Set δ j := δ µ / (2 d ) , so log(2 /δ j ) = log(4 d/δ µ ) . Then, with probability at least 1 − δ µ / 2 ov er the noise (and conditional on X ), the above bound holds simultaneously for all j = 1 , . . . , d by a union bound. Step 3: combine and translate to pr ediction err or . On the intersection of the e vents from Steps 1 and 2 (which has probability at least 1 − δ µ ), for each j , ∥ ˆ β j − β j ∥ 2 = ∥ ( X ⊤ X ) − 1 / 2 g j ∥ 2 ≤ ∥ g j ∥ 2 p λ min ( X ⊤ X ) ≤ 4 σ p p + log (4 d/δ µ ) p n µ κ/ 2 = 4 √ 2 σ √ κ s p + log (4 d/δ µ ) n µ . Therefore, ∥ ˆ A µ − A µ ∥ 2 F = d X j =1 ∥ ˆ β j − β j ∥ 2 2 ≤ 32 σ 2 κ · d p + log (4 d/δ µ ) n µ . T aking square roots yields ( 26 ). Moreover , for ev ery ξ with ∥ ξ ∥ 2 ≤ 1 , ∥ ˆ µ ( ξ ) − µ ( ξ ) ∥ 2 = ∥ ( ˆ A µ − A µ ) ξ ∥ 2 ≤ ∥ ˆ A µ − A µ ∥ F ∥ ξ ∥ 2 ≤ ∥ ˆ A µ − A µ ∥ F , which implies ( 27 ). Finally , since ∥ ξ ∥ 2 ≤ 1 a.s., ε 2 µ = E ξ ∥ ( ˆ A µ − A µ ) ξ ∥ 2 2 ≤ ∥ ˆ A µ − A µ ∥ 2 F , which implies ( 28 ). D.4 Stage-I repr esentation error bound (Pr oof of Theorem 29 ) W e first relate the regression error of ˆ µ to the probability that the induced plug-in decision disagrees with the Bayes rule. Fix η > 0 . On the e vent that µ ( ξ ) lies at distance > η from the cone boundary B X , the optimal extreme point is constant throughout the ball B ( µ ( ξ ) , η ) . Hence, if the regression estimate is η -accurate and the learned dataset ˆ D is pointwise suf ficient at ˆ µ ( ξ ) , then the lifted predictor ˜ µ ( ξ ) induces the same unique decision as µ ( ξ ) . The lemma below formalizes this decomposition. Lemma 46. Assume c ∈ C almost sur ely , and let ˜ µ be defined in ( 12 ) . Under Assumption 28 , define for η > 0 τ µ ( η ) := P ξ ∥ ˆ µ ( ξ ) − µ ( ξ ) ∥ 2 > η . Then, for any η > 0 , P ξ x ⋆ ( ˜ µ ( ξ )) = x ⋆ ( µ ( ξ )) ≤ P ξ ˆ D is not pointwise sufficient at ˆ µ ( ξ ) + τ µ ( η ) + C marg η α . (29) 43 Pr oof. Define the ev ents A ( ξ ) := { ˆ D is pointwise sufficient at ˆ µ ( ξ ) } , B ( ξ ) := {∥ ˆ µ ( ξ ) − µ ( ξ ) ∥ 2 ≤ η } , C ( ξ ) := { dist( µ ( ξ ) , B X ) > η } , D ( ξ ) := { ˜ µ ( ξ ) / ∈ B X } . Note that P ξ [ D ( ξ ) c ] = 0 by Assumption 28 . Step 1: ˜ µ ( ξ ) is fiber -equivalent to ˆ µ ( ξ ) under ˆ D . Since ˆ µ ( ξ ) ∈ C a.s. and ˆ U spans ˆ D , we hav e ˆ U ⊤ ( ˆ µ ( ξ ) − c 0 ) ∈ ˆ U ⊤ ( C − c 0 ) and thus Lemma 41 implies ˜ µ ( ξ ) ∈ C . Moreov er , using the definition of ˜ µ and the lifting operator lift ˆ U ( s ) = c 0 + L ˆ U s (see ( 8 )), we ha ve ˆ U ⊤ ( ˜ µ ( ξ ) − c 0 ) = ˆ U ⊤ L ˆ U ˆ U ⊤ ( ˆ µ ( ξ ) − c 0 ) = ˆ U ⊤ Σ ˆ U ˆ U ⊤ Σ ˆ U − 1 ˆ U ⊤ ( ˆ µ ( ξ ) − c 0 ) = ˆ U ⊤ ( ˆ µ ( ξ ) − c 0 ) . Therefore, for any q ∈ ˆ D ⊆ span( ˆ U ) we can write q = ˆ U a for some a ∈ R t , and thus q ⊤ ˜ µ ( ξ ) − q ⊤ ˆ µ ( ξ ) = a ⊤ ˆ U ⊤ ˜ µ ( ξ ) − ˆ µ ( ξ ) = a ⊤ ˆ U ⊤ ( ˜ µ ( ξ ) − c 0 ) − ˆ U ⊤ ( ˆ µ ( ξ ) − c 0 ) = 0 . In particular , ˜ µ ( ξ ) and ˆ µ ( ξ ) lie in the same fiber C ˆ D , s ( ˆ µ ( ξ ); ˆ D ) . Step 2: On A ( ξ ) ∩ B ( ξ ) ∩ C ( ξ ) ∩ D ( ξ ) , the oracle decisions coincide. On A ( ξ ) , pointwise suf ficiency at ˆ µ ( ξ ) means that there e xists some decision x ps ( ξ ) ∈ X such that x ps ( ξ ) ∈ X ⋆ ( c ′ ) ∀ c ′ ∈ C ˆ D , s ( ˆ µ ( ξ ); ˆ D ) . Since ˜ µ ( ξ ) is in the same fiber, we ha ve x ps ( ξ ) ∈ X ⋆ ( ˜ µ ( ξ )) and also x ps ( ξ ) ∈ X ⋆ ( ˆ µ ( ξ )) . Next, on B ( ξ ) ∩ C ( ξ ) we hav e ˆ µ ( ξ ) ∈ B ( µ ( ξ ) , η ) while B ( µ ( ξ ) , η ) is contained in the interior of a single normal cone. Hence, the optimal extreme point is unique throughout this ball, and in particular X ⋆ ( ˆ µ ( ξ )) = { x ⋆ ( ˆ µ ( ξ )) } and x ⋆ ( ˆ µ ( ξ )) = x ⋆ ( µ ( ξ )) . Finally , on D ( ξ ) we have ˜ µ ( ξ ) / ∈ B X , so X ⋆ ( ˜ µ ( ξ )) = { x ⋆ ( ˜ µ ( ξ )) } is also a singleton. Since x ps ( ξ ) is optimal for both ˆ µ ( ξ ) and ˜ µ ( ξ ) , uniqueness forces x ⋆ ( ˜ µ ( ξ )) = x ps ( ξ ) = x ⋆ ( ˆ µ ( ξ )) = x ⋆ ( µ ( ξ )) . Step 3: Conclude by a union bound. Thus, on A ( ξ ) ∩ B ( ξ ) ∩ C ( ξ ) ∩ D ( ξ ) we hav e x ⋆ ( ˜ µ ( ξ )) = x ⋆ ( µ ( ξ )) , and therefore P ξ x ⋆ ( ˜ µ ( ξ )) = x ⋆ ( µ ( ξ )) ≤ P ξ [ A ( ξ ) c ] + P ξ [ B ( ξ ) c ] + P ξ [ C ( ξ ) c ] + P ξ [ D ( ξ ) c ] . The first term is the pointwise-suf ficiency failure probability at ˆ µ ( ξ ) . The second term is exactly τ µ ( η ) . The third term is controlled by Assumption 28 , giving P [ C ( ξ ) c ] ≤ C marg η α . Finally , P [ D ( ξ ) c ] = 0 by Assumption 28 , which yields ( 29 ). Combining the tail-form transfer bound ( 29 ) , the certificate guarantee of Theorem 25 for Algorithm 2 , and the bounded-design OLS control of Lemma 45 yields the main finite-sample bound on the representation- induced error of Stage I. Pr oof of Theor em 29 . On the regression e vent from Lemma 45 , equation ( 27 ) implies that ∥ ˆ µ ( ξ ) − µ ( ξ ) ∥ 2 ≤ r µ,δ µ for e very ξ with ∥ ξ ∥ 2 ≤ 1 . Since ∥ ξ ∥ 2 ≤ 1 almost surely , this gives τ µ r µ,δ µ = P ξ ∥ ˆ µ ( ξ ) − µ ( ξ ) ∥ 2 > r µ,δ µ = 0 . 44 Applying ( 29 ) with η = r µ,δ µ therefore yields P ξ x ⋆ ( ˜ µ ( ξ )) = x ⋆ ( µ ( ξ )) ≤ P ξ ˆ D is not pointwise sufficient at ˆ µ ( ξ ) + C marg r α µ,δ µ . On the same regression e vent, Theorem 25 applied to the pseudo-cost sample { ˆ c j } n disc j =1 gi ves, with probability at least 1 − δ ov er the discov ery contexts, P ξ ˆ D is not pointwise sufficient at ˆ µ ( ξ ) ≤ 4 n disc (6 | T | + log ( e/δ )) . Combining the two displays and taking a union bound over the regression and discov ery samples proves ( 14 ) . For the SPO misspecification bound, note that for an y predictor f , R SPO ( f ) − R SPO ( f ⋆ ) = E ξ h µ ( ξ ) ⊤ x ⋆ ( f ( ξ )) − µ ( ξ ) ⊤ x ⋆ ( µ ( ξ )) i ≥ 0 , and for each ξ the bracketed dif ference is at most ω X ( C ) and is zero whene ver the two decisions coincide. Hence R SPO ( f ) − R SPO ( f ⋆ ) ≤ ω X ( C ) P ξ [ x ⋆ ( f ( ξ )) = x ⋆ ( µ ( ξ ))] . Apply this to the specific candidate f = ˜ µ . By ( 13 ) we ha ve ˜ µ ∈ H ˆ U ,t , so the optimality of ˆ f ⋆ in H ˆ U ,t yields R SPO ( ˆ f ⋆ ) ≤ R SPO ( ˜ µ ) . Combining these facts with ( 14 ) gi ves ( 15 ). 45
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment